Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹COVID-19 screening

Pathogen diversity drives the evolution of generalist MHC-II alleles in human populations


The major histocompatibility complex (MHC) genes in vertebrates encode cell surface proteins and are essential components of adaptive immune recognition. MHC proteins are endowed with highly variable peptide-binding domains that bind short protein fragments. The MHC region is one of the most polymorphic gene clusters in vertebrate genomes. Co-evolutionary arms race with pathogens is considered largely responsible for the observed exceptionally high levels of genetic diversity [3–6], yet it cannot fully account for the observed geographic differences in human MHC genetic diversity [7, 8]. This indicates that, beyond MHC allelic diversity, other MHC-related factors contribute to the capacity of human populations to withstand pathogens. In this paper, we argue that peptide-binding repertoire size (or, shortly, promiscuity) of MHC alleles is one important factor.

Recent empirical studies demonstrated that there is a substantial variation in the size of the bound and presented antigen repertoire across MHC class I alleles. Certain MHC class I alleles appear to be promiscuous and are capable of binding an exceptionally large set of epitope peptide segments [9, 10]. For example, Paul and colleagues carried out bioinformatics analysis to predict the binding capacity of common HLA-A and HLA-B alleles to a set of 30,000 dengue virus–derived peptides. The analysis revealed over 16-fold variation in the number of peptides bound by the different alleles, indicating significant variation in epitope repertoire size across HLA molecules. The authors selected three alleles for further study in an in vivo transgenic mouse model. Immunization of the corresponding HLA transgenic mouse strains with a set of dengue virus–derived peptides revealed a positive relationship between epitope repertoire size and immunogenicity. Similarly, Kosmrlj and colleagues computed the fraction of self-peptides that bind to various HLA-B molecules, and found that this fraction varies extensively across four HLA-B alleles. The authors then demonstrated that the self-peptide–binding repertoire of HLA-B shapes the native repertoire of T-cell clones developed in the thymus, with implications for recognizing human immunodeficiency virus (HIV) epitopes. Their results could explain why individuals carrying HLA-B*57 alleles can maintain low HIV RNA without therapy. Remarkably, analogous MHC class I alleles with the HLA-B*27 superfamily is widespread in Chinese rhesus macaques, animals which show especially slow progression of simian immunodeficiency virus (SIV)/HIV. Finally, by focusing on seven chicken MHC class I haplotypes and four human HLA-B alleles, Chappel and colleagues demonstrated that MHC class I molecules that can bind a wide range of viral epitopes show lowered expression on the cellular surfaces of immune cells, such as monocytes and lymphocytes. The authors suggested that the breadth of epitope-binding repertoire shapes genetic susceptibility to Marek’s disease virus in chickens and HIV disease progression in humans.

More generally, by recognizing more peptide segments, promiscuous MHC molecules may promote immune response against a broader range of pathogens and are hence generalists [9, 10]. Prior case studies in chicken indicate that this may be so [14–17]. However, it remains to be established whether this relationship generally holds across MHC class I and II alleles and a wide range of infectious diseases. Specifically, we propose that in regions of high pathogen diversity, human populations should carry promiscuous MHC alleles. Moreover, as migrating human populations have been exposed to changing sets of pathogens, shifts in MHC promiscuity level should have occurred repeatedly and in a rapid manner during the course of human evolution.

To test these predictions, we first focused on the human HLA class II DRB1 gene, for several reasons. First, DRB1 is the most variable HLA class II locus, with over 2,000 registered alleles. Together with HLA-DRA, HLA-DRB1 encodes the heterodimeric HLA-DR protein complex, but HLA-DRA is basically invariant. Second, DRB1 shows the strongest general signature of selection among HLA class II loci, while at the same time showing the weakest evidence for divergent allele advantage, an alternative mechanism at the genotype level for presenting a broader set epitopes. Third, DRB1 has diversified very rapidly in the human lineage. Many of the DRB1 alleles appear to be human specific and most likely evolved after the migration of ancestral human populations out of Africa. These periods have been associated with human populations encountering numerous new pathogens [18, 23]. For other HLA class II loci, the level of genetic diversity is lower [19, 24], probably driven by selection for functions partly unrelated to pathogens. Notably, HLA-DQ has a fundamental role in the development of immune tolerance [25, 26], while HLA-DP contributes to the presentation of epitopes of intracellular origins [27–29]. Fourth, epitope-binding prediction algorithms show higher accuracy for DRB1 than for other HLA class II loci [30, 31]. Finally, the abundance of DRB1 on the cell surface is especially high compared with other HLA class II molecules [32–34]. Subsequently, we also evaluated promiscuity patterns of HLA class I molecules.

Estimates on epitope-binding promiscuity were derived from two sources: experimental assays that measured individual peptide–MHC interactions in vitro and systematic computational predictions. In a series of analyses, we show that predictions of our hypothesis are upheld, regardless of how HLA-DRB1 promiscuity level is estimated.

Estimating HLA-DRB1 promiscuity level

Given that large-scale experimental assays to measure individual peptide–MHC interactions are extremely tedious, we first employed established bioinformatics tools to predict the binding affinities of experimentally verified epitope peptides for a panel of 162 nonsynonymous HLA-DRB1 alleles, all of which are present at detectable frequencies in at least one human population [35–37]. The set of investigated epitopes was derived from the Immune Epitope Database (IEDB) and contains 2,691 peptide epitopes of 71 pathogens known to be bound by certain HLA class II alleles (S1 Data). Epitopes showing high levels of amino acid similarity to each other were excluded from the analysis (See Methods). Most included epitopes are 15 to 20 amino acids long and are found in only one of the 71 pathogens (S1 Fig, S1 Data). The NetMHCIIpan algorithm was used to predict individual epitope–MHC interactions, not least because it outperforms other prediction algorithms. The breadth of epitope-binding repertoire or, shortly, the level of promiscuity of individual HLA-DRB1 alleles was estimated as the fraction of epitopes with a binding affinity stronger than 50 nM to the given MHC molecule. This threshold corresponds to high-affinity binding, which is frequently observed in MHC molecules displaying immunodominance. We found large variation in promiscuity levels across HLA-DRB1 alleles (S2 Fig, S2 Data). Using a smaller dataset with information from both approaches, we show that the computationally predicted and the empirically estimated promiscuity values are strongly correlated with each other (Spearman’s rho: 0.78, P = 0.004, S3 Fig). Moreover, our results are robust to changes in the affinity threshold (S4A to

S4C Fig), usage of other prediction algorithms (S4D Fig), and variation in the epitope data employed (S4E Fig). As expected, promiscuous HLA-DRB1 alleles can present epitopes from a broader range of pathogen species (S5 Fig). Reassuringly, there was no correlation between allele promiscuity values and the amount of data per allele used for the training of the algorithm (Spearman’s rho: −0.38, P = 0.21).

Global distribution of promiscuous HLA-DRB1 alleles

Taking advantage of the confirmed reliability of computational predictions, we next investigated the geographic distribution of HLA-DRB1 alleles. We first collected high-quality HLA-DRB1 allele prevalence data of 96 human populations residing in 43 countries from two databases and an article [35–37]. The weighted average of promiscuity level in each population was calculated based on the promiscuity values and allele frequencies of individual alleles in the population (See Methods). The analysis revealed a large variation in mean promiscuity across geographical regions and the corresponding human populations (S1 Table). Importantly, several distantly related but highly promiscuous alleles contribute to this pattern (S1 Table). Notably, an especially high allelic promiscuity level was found in Southeast Asia, an important hot spot of emerging infectious diseases. To minimize any potential confounding effect of high genetic relatedness between neighboring populations, we merged populations with similar HLA allele compositions for all further analyses (See Methods).

Link between pathogen diversity and HLA-DRB1 promiscuity level

Using the Global Infectious Diseases and Epidemiology Network (GIDEON), we compiled a dataset on pathogen richness in the corresponding 43 geographic regions. It consists of 95 diseases caused by 168 extracellular pathogens, including diverse bacterial species, fungi, protozoa, and helminthes. Using the same protocol, we additionally compiled a dataset on the prevalence of 149 diseases in the same regions caused by 214 viral and other obligate intracellular pathogens. The dataset and methodology employed for the analysis are standardized and have been used previously in similar contexts [7, 8, 42].

We report a strong positive correlation between extracellular pathogen diversity and mean promiscuity: HLA-DRB1 alleles that can bind epitopes from a broader range of pathogens are more likely to be found in regions of elevated pathogen diversity (Fig 1A). This pattern is unlikely to be explained by confounding factors, such as country size or HLA-DRB1 genetic diversity across countries (S2 Table). By contrast, we found no significant association between HLA-DRB1 promiscuity level and diversity of intracellular pathogens (Fig 1B). We conclude that the geographical distribution of promiscuous HLA-DRB1 alleles has been mainly shaped by the diversity of extracellular pathogens.

The above results hold—and are even stronger—when estimates on promiscuity were derived from empirical in vitro MHC binding data (shortly, in vitro promiscuity), downloaded from the IEDB database (Fig 1C and 1D, S2 Table and S3 Data). However, these results do not exclude the possibility that the geographical link between pathogen diversity and promiscuity is indirect. More direct support on the causal relationship between the two variables comes from analysis of prior human genetic studies. To investigate this issue, we focused on two geographically widespread allelic groups with exceptionally high (HLA-DRB1*12) and exceptionally low (HLA-DRB1*03) promiscuity values, respectively, and conducted literature mining on their reported associations with infectious diseases (S3 Table). As expected, HLA-DRB1*12 was associated with protection against at least five infectious diseases, while HLA-DRB1*03 was associated with susceptibilities to eight infectious diseases, which is highly unlikely by chance (Fisher test, P = 0.003) (S3 Table, S5 Data). The data also indicate local adaptation towards elevated promiscuity under diverse pathogen pressure. The HLA-DRB1*12:02 allele is prevalent in specific regions of Southeast Asia. Compared with other alleles detected in this region, HLA-DRB1*12:02 has a relatively high promiscuity value (top 20%, Fig 2A, S2 Data). The high frequency of HLA-DRB1*12:02 has been previously suggested to reflect pathogen-driven selection during the migration of a Mongolian population to South China. Indeed, this allele is associated with protection from recurrent pulmonary tuberculosis, recurrent typhoid fever, and hepatosplenic schistomiasis (S5 Data), all of which are endemic diseases in Southeast Asia [44–46]. Remarkably, the frequency of this allele increases with extracellular pathogen diversity in this region (Fig 2B). Together, these observations support the hypothesis that promiscuous epitope binding of HLA-DRB1 alleles is favored by selection when extracellular pathogen diversity is high.

Evolution of promiscuous HLA alleles

An important unresolved issue is how promiscuity has changed during the course of human evolution. Under the assumption that local pathogen diversity drives the evolution of epitope recognition of HLA class II alleles, promiscuity as a molecular trait should have evolved rapidly as human populations expanded into new territories. To investigate this issue, we combined an established phylogeny of HLA-DRB1 alleles with predicted epitope-binding promiscuity values. We found that alleles with a high promiscuity level have a patchy distribution across the tree (S6 Fig), indicating that high promiscuity has multiple independent origins. To investigate this observation further, we selected a set of 96 HLA-DRB1 alleles with a detectable frequency in at least one human population and appropriate sequence data (see Methods). A comparison of all pairs of these alleles revealed that even very closely related alleles show major differences in promiscuity levels (Fig 3A). For example, alleles belonging to the HLA-DRB1*13 group show over 98% amino acid sequence identity to each other, but display as much as 57-fold variation in the predicted promiscuity levels. We conclude that the switch between high and low promiscuity levels has occurred repeatedly and in a rapid manner during the allelic diversification of the HLA-DRB1 locus.

We next asked how selection on promiscuity has shaped the genetic diversity along the epitope-binding region of HLA molecules. To quantify protein sequence variability at each amino acid position, we calculated the Shannon entropy index based on the alignment of the 96 selected HLA-DRB1 alleles from above. For each position, we also calculated promiscuity fragility, that is, the median impact of single amino acid substitutions on promiscuity (see Methods). A strong positive correlation was found between Shannon entropy and promiscuity fragility (Fig 3B, Spearman’s rho = 0.76, P = 0.0001). Importantly, this conclusion does not depend on how sequence polymorphism was estimated (S7 Fig). Accordingly, amino acid positions with a large impact on epitope-binding promiscuity are highly variable in human populations. Furthermore, those sites in the epitope-binding region that are under positive selection tend to have high promiscuity fragility values (Wilcoxon rank sum test, P = 0.0012, Fig 3B; see also S8 Fig). The above data suggest a link between allele promiscuity and HLA-DRB1 diversification, probably as an outcome of selection for locally optimized promiscuity levels.

Finally, we note that several variable molecular sites in the binding region of HLA-DRB1 affect epitope-binding characteristics without any major impact on promiscuity per se. For example, our computational analysis indicates that mutations at amino acid site numbers 9 and 47 do not seriously affect promiscuity level (Fig 3B). However, several mutations at these sites are associated with binding self-peptides and thereby shape vulnerability to specific autoimmune diseases [49, 50].

Pathogen diversity and HLA class I promiscuity

The relationship between pathogen diversity and epitope-binding promiscuity may be more general, as similar results hold for the HLA-A locus. HLA-A is one of the three types of classical human MHC class I molecules and is mainly involved in the presentation of epitopes from intracellular pathogens. In agreement with expectation, we report a positive correlation between local intracellular pathogen diversity and the HLA-A promiscuity level of the corresponding human populations (S9A and S9B Fig, S3 Data). No marked positive correlation was found for two other MHC class I genes (HLA-B and HLA-C, see S9C to

S9F Fig, and S3 Data). Therefore, other unrelated evolutionary forces may shape the geographical distribution of promiscuous HLA-B and HLA-C alleles (S1 Text).


Central players of the adaptive immune system are the groups of proteins encoded in the MHC. By binding short peptide segments (epitopes), MHC molecules guide both immune response against pathogens and tolerance to self-peptides. The genomic region encoding these MHC molecules is of special interest, for two reasons. It harbors more disease associations than any other regions in the human genome, including associations with infectious diseases, autoimmune disorders, tumors, and neuropsychiatric diseases [52, 53]. A growing body of literature is now revealing that certain MHC class I alleles can bind a wider range of epitopes than others, but the functional implications of this variation remain largely unknown. By recognizing a larger variety of epitopes, such promiscuous MHC alleles promote immune response against a broader range of pathogens at the individual level. Therefore, promiscuous epitope binding of MHC molecules should be favored by selection in geographic regions where extracellular pathogen diversity is high. Importantly, this mechanism is conceptually distinct from the well-established concept of heterozygote advantage at the MHC, as it concerns individual alleles and not allele combinations or genotypes.

To test this hypothesis, we combined data on the geographic distribution of human MHC class II alleles and prevalence of extracellular pathogens, empirical/computational estimates of epitope-binding promiscuity, and phylogenetic analyses. Our main findings, strongly supporting our hypothesis, are as follows.

First, in geographical regions of high extracellular pathogen diversity, human HLA-DRB1 alleles have exceptionally high epitope-binding repertoires. This suggests that the geographical distribution of promiscuous HLA-DRB1 alleles has been shaped by the diversity of extracellular pathogens. The HLA-DRB1*12:02 allele highlights this point. HLA-DRB1*12:02 is a promiscuous allele that has been associated with protection from certain infectious diseases (S5 Data). As expected, this allele is especially prevalent in regions of Southeast Asia with elevated pathogen load (Fig 2B).

It is well established that antigens presented by HLA class II molecules derive mainly from extracellular proteins. However, HLA class II molecules have well-established roles in controlling immune response against viruses [55, 56]. Additionally, viral peptides are reported to be processed and presented also by the HLA class II pathway. Therefore, it remains to be established why intracellular pathogen diversity has no major impact on the global distribution of HLA-DRB1 alleles.

Notably, the relationship between pathogen load and epitope-binding promiscuity may be more general, as similar results hold for the HLA-A locus: we found a positive correlation between local intracellular pathogen diversity and the HLA-A promiscuity level of the corresponding human populations (S9A and S9B Fig, S3 Data).

Second, a phylogenetic analysis revealed major differences in promiscuity levels of very closely related HLA-DRB1 alleles. This suggests that high promiscuity level in HLA-DRB1 has evolved rapidly and repeatedly during human evolution. Finally, amino acid positions with a prominent role in shaping HLA-DRB1 promiscuity level are especially variable in human populations and tend to be under positive selection. In sum, we conclude that HLA promiscuity level is a human trait with paramount importance during adaptation to local pathogens.

Our work has important ramifications for future studies. MHC is the most variable region of the human genome, and the variation is associated with numerous infectious and immune-mediated diseases [52, 53, 58–62]. The impact of MHC promiscuity level on population allelic diversity is an interesting area for future research. In a similar vein, MHC allelic diversity is associated with olfaction-based mating preferences in human and other animals. The roles of MHC promiscuity in mating success and mating preferences are a terra incognita.

We note that the most promiscuous HLA-DRB1 alleles are rare in certain human populations (S1 Table; S2 Data). This suggest that these alleles are not particularly favored by natural selection in these areas. Why should it be so? First, high promiscuity may not be able to cope with the rise of novel and highly virulent pathogens. In such cases, displaying a particular epitope might be the most efficient way to achieve resistance, and high promiscuity might be suboptimal due to a reduced specificity [9, 10]. Second, high promiscuity level may elevate the risk of immune reactions against host tissues and non-harmful proteins [9, 64]. Clearly, future work should elucidate the evolutionary trade-offs between protection from pathogens and genetic susceptibility to autoimmune diseases. This will require high-throughput experimental methods to determine epitope-binding repertoire, and HLA transgenic mice studies on the role of promiscuity in immune response.

Finally, genetic variation within particular MHC genes influences vaccine efficacy, rejection rates of transplanted organs, susceptibility to autoimmune diseases, and antitumor immunity [28, 69, 70]. Our work raises the possibility that, by altering the maturation and functionality of the immune system, the size of the epitope-binding repertoire of MHC alleles itself could have an impact on these processes. The exact role of MHC promiscuity in these crucial public health issues is an exciting future research area.

Computational prediction of epitope-binding promiscuity

The IEDB has collected the results of individual and systematic studies on epitope binding by MHC alleles. The experimental studies include HLA-binding assays, T-cell activation assays, and immunopeptidomic studies as well. Epitopes of all available viral, bacterial, and eukaryotic pathogens known to be bound by at least one HLA-I or HLA-II allele were extracted from IEDB. Reference proteomes of pathogenic species that carry at least one of the collected epitope sequences were retrieved from the Uniprot database (102 for HLA-I and 71 for HLA-II epitopes). Only epitopes of these species were analyzed further. All proteomes were scanned for each epitope sequence, and epitope sequences found in only one proteome (i.e., species-specific epitopes) were kept for further analysis. Highly similar epitope sequences were identified using Clustal Omega and excluded as follows. A protein distance matrix was created and epitopes were discarded iteratively. In each iteration, the epitope pairs with the lowest k-tuple distance were identified. Then, the epitope with the highest average similarity to all other sequences was excluded. Iterations were repeated until distance values less than 0.5 (corresponding to greater than approximately 50% sequence identity) were eliminated from the matrix. Note that this filtering procedure was carried out separately for epitope sequences bound by HLA-I and HLA-II.

Binding affinities of the remaining 3,265 HLA-I epitope sequences to 346 HLA-A, 532 HLA-B and 225 HLA-C alleles were predicted with the NetMHCpan-4.0 algorithm. The binding of 2,691 HLA-II epitope sequences to 162 HLA-DRB1 alleles was predicted using the NetMHCIIpan-3.1 algorithm. All 162 alleles are present in at least one of the human populations studied here (see below). The “pep” sequence input format was used for both HLA-I and HLA-II epitope-binding prediction. A binding affinity threshold of 50 nM was applied, yielding peptides that are likely to be immunodominant. For alternative binding threshold definitions, see S4 Fig. For each binding threshold, epitope-binding promiscuity was defined as the fraction of the epitope set bound by a given allele.

Calculating epitope-binding promiscuity using in vitro data

To determine the epitope-binding promiscuity of HLA-DRB1 alleles based on previously published experimental data, we used the IEDB database. Specifically, we downloaded all MHC ligand and T-cell assay data, which was available for 48 HLA-DRB1 alleles. Binding data of 20 alleles screened for at least 100 ligands each were further analyzed. The epitope set of each allele was filtered for highly similar sequences, as described above. As the majority of in vitro assay data were available in a binary format (i.e., presence or absence of binding), promiscuity was calculated as the fraction of positive binding assays for a given allele.

Calculating promiscuity levels of human populations

To calculate population-level promiscuity values, we obtained HLA allele frequency data from the Allele Frequency Net Database (AFND) and the International Histocompatibility Working Group (IHWG) populations [35, 36]. Haplotype-level data of the 13th International HLA and Immunogenetics Workshop (IHIW) populations were downloaded from dbMHC (National Center for Biotechnology Information [NCBI]; Additionally, allele frequency data of the 14th and 16th IHIW populations, as published by Riccio and colleagues, and populations in the AFND were used in the analyses. To avoid potential confounding effects of recent genetic admixture and migration, we focused on native populations, similarly to previous studies (S1 Table) [7, 8]. We excluded IHWG populations reported to deviate from Hardy-Weinberg equilibrium. Among the AFND populations and IHWG populations without haplotype-resolution data (14th and 16th IHIW), those comprising less than 100 genotyped individuals or those in which the sum of allele frequencies deviated from 1 by more than 1% were excluded. Populations reported in multiple databases were included only once in the analysis.

For each HLA loci, we calculated mean population promiscuity by averaging promiscuity values of alleles weighted by their relative frequencies in the populations. In all of these calculations, we used standardized (i.e., z-score) promiscuity values to make the in silico and in vitro values more easily comparable. Finally, when calculating population-level promiscuity based on in vitro promiscuity data, we excluded populations for which in vitro promiscuity values could be assigned to less than 50% cumulative allele frequency.

To tackle the issue of nonindependence of data points, we focused on populations instead of countries and grouped those populations that have highly similar HLA allele compositions, based on standard measures of genetic distance (see below). We merged populations with highly similar HLA allele compositions, allowing us to avoid pseudoreplication of data points while retaining informative allele frequency differences between populations residing in the same broad geographical areas.

To this end, we first generated a genetic distance matrix between populations with the adegenet R library using allele frequency data of the examined locus. We used the Rogers’ genetic distance measure because it does not assume that allele frequency changes are driven by genetic drift only, an unlikely scenario for HLA genes. Next, populations were merged using a network-based approach. Populations were treated as nodes and two nodes were connected if their genetic distance was under a cutoff value. Populations were grouped in an iterative manner. In each iteration, all maximal cliques (i.e., subsets of nodes that are fully connected to each other) in the network were identified. Maximal cliques represent groups of populations in which all populations have similar allele compositions to each other. Then, mean genetic distance within each clique was calculated. The clique with the lowest average distance was selected and its populations were grouped together. Then, this clique was deleted from the network. Iterations were repeated until no maximal cliques remained in the network. Grouping of populations was carried out using different distance value cutoffs (1st, 5th, 10th, and 15th rank percentile of all distance values). The resulting population groups and the individual populations that remained in the network were treated as independent data points in subsequent statistical analyses. Mean promiscuity level in population groups was calculated by averaging population promiscuity values.

Unless otherwise indicated, all figures are based on population groups using the 15th percentile genetic distance cutoff value. Importantly, using different cutoffs has no impact on our results (S3 Data). Finally, we note that genetic differences among human populations mostly come from gradations in allele frequencies rather than from the presence of distinctive alleles. Therefore, traditional clustering of populations based on HLA composition would have been ill-suited for our purposes.

Pathogen diversity

Data on 309 infectious diseases were collected from GIDEON. For each disease, the number of causative species or genera (when species were not listed for the genus) was determined using disease information in the GIDEON database, as described previously. Causative agents were classified into obligate intracellular and extracellular pathogen groups based on a previous study and literature information. Putative facultative intracellular pathogens were excluded from the analysis. Diseases caused by agents that could not be clearly classified were also excluded from the analysis. Extracellular and intracellular pathogen diversity (richness) of each country was approximated by the number of identified endemic extracellular and intracellular species, respectively.

Finally, we assigned country-level measures of pathogen and HLA diversity to population groups as follows. For each population group, extracellular and intracellular pathogen counts were calculated by averaging the corresponding country-level values across the populations within the group. For example, if a population group contained two populations residing in neighboring countries, then we assigned the average pathogen diversity of the two countries to it.

Literature mining of associations between HLA alleles and infectious diseases

To examine associations between selected HLA allele groups and infectious diseases, we carried out a systematic literature search on PubMed database using the following terms:

“assoc* drb1 12 02”, “assoc* drb1 1202”, “assoc* drb1 12 01”, “assoc* drb1 1201”, “assoc* dr12”, “assoc* drb1*12”, “assoc* drb1 03 01”, “assoc* drb1 0301”, “assoc* dr3”, “assoc* drb1*03”, “assoc* dr17”, “infect* drb1 12 02”, “infect* drb1 1202”, “infect* drb1 12 01”, “infect* drb1 1201”, “infect* dr12”, “infect* drb1*12”, “infect* drb1 03 01”, “infect* drb1 0301”, “infect* dr3”, “infect* drb1*03”, and “infect* dr17”.

“assoc*” and “infect*” refer to any word beginning with these letters.

Each resulting paper containing HLA association data was examined, and statistically significant associations between allele groups (DRB1*03 or DRB1*12) or common alleles in allele groups (DRB1*12:01, DRB1*12:02, DRB1*03:01) and infectious diseases were collected. Associations with diseases caused by intracellular pathogens were excluded from the analysis. HLA-disease associations were classified as beneficial or detrimental, if all related studies supported the beneficial or detrimental role of HLA allele/allele group in the development or course of the given disease. Otherwise, the association was classified as controversial. The results were summarized (S3 Table), and statistical association between beneficial/detrimental effects and high/low promiscuity across allele groups was determined by a Fisher’s exact test.

Amino acid distance between DRB1 alleles

We used amino acid distance as a proxy for phylogenetic distance between pairs of DRB1 alleles. To this end, nucleotide sequences of DRB1 alleles that contained full exon 2 and 3 regions were downloaded from the IPD-IMGT/HLA database. To limit our analyses to alleles that have an impact on the inferred promiscuity level of a population, we considered only those sequences that had a nonzero frequency in at least one human population (see above). From allele groups that code for the same protein sequence (synonymous differences, differentiated by the third set of digits in the HLA nomenclature), one of the alleles was randomly chosen. This selection procedure resulted in 96 alleles. Multiple alignment of nucleotide sequences was performed using the MUSCLE algorithm as implemented in the MEGA software and converted to protein sequence alignments. Amino acid distance was calculated using the Jones-Taylor-Thornton substitution model in MEGA (Fig 3A). Epitope-binding region sites—as defined previously—were excluded when calculating amino acid distance. The rationale behind this exclusion is that these sites are known to be under positive selection [78, 79] and are therefore less informative on evolutionary distance. Additionally, by removing these sites, the amino acid distance remains independent of promiscuity predictions. Finally, as intragenic recombination may distort the inference of evolutionary distance, we identified such events across all alleles following the protocol of Satta and colleagues using GENECONV and RDP algorithms as implemented in the RDP software. Recombinant alleles were removed when calculating amino acid distance.

Sequence diversity and promiscuity fragility

We first defined the epitope-binding region of HLA-DRB1 alleles, as previously. To estimate sequence diversity along the epitope-binding region, we employed two measures: standard Shannon entropy and nucleotide diversity (π), a widely employed measure of genetic variation.

Using the protein sequence alignment of the 96 alleles defined above, we calculated amino acid sequence variability as the Shannon entropy of the given amino acid site as follows:


where Pi is the fraction of residues of amino acid type i at a given site, and M is the number of amino acid types observed at that site.

Nonsynonymous nucleotide diversity (πA) measures the average number of nonsynonymous nucleotide differences per nonsynonymous site between two randomly chosen protein coding DNA sequences from the same population [85, 86]. πA was calculated for each amino acid site in the epitope-binding region for each population using DnaSP software and custom-written R scripts. Nucleotide sequences of DRB1 alleles were downloaded from the IPD-IMGT/HLA database.

The calculation is based on the work of Nei and colleagues using the equation


where xi and xj are the frequencies of the ith and jth alleles in the population, respectively, and πAij is the number of nonsynonymous nucleotide differences per nonsynonymous nucleotide site between the two codon sequences of the given amino acid site in the ith and jth alleles. To calculate πA for each population, allele frequency data of human populations were obtained, as described earlier (see above). An overall nucleotide diversity index was calculated by averaging πA across populations.

To calculate each amino acid site’s impact on epitope-binding promiscuity (promiscuity fragility), promiscuity was predicted for each 19 possible amino acid change along the epitope-binding region of each of 96 alleles. The fold difference in promiscuity resulting from each amino acid substitution was calculated. The median promiscuity fold difference of each possible allele and amino acid change combination (96 × 19) was used to estimate promiscuity fragility at each amino acid position. As some of the 19 possible amino acid changes are not accessible via a single nucleotide mutation, and the accessible amino acid changes can have different likelihoods based on the codon sequence of the site and the genetic code, we also calculated promiscuity fragility based on each nonsynonymous nucleotide substitution of the codon instead of each amino acid substitution of the site.

Statistical analysis and graphical representation

All statistical analyses were carried out in R version 3.2.0. Smooth curves were fitted using the cubic smoothing spline method.