Middle East respiratory syndrome coronavirus (MERS-CoV) was first isolated from a patient in Saudi Arabia in 2012 and has been shown to cause severe acute respiratory illness, including fever, cough, and shortness of breath (Zaki et al. 2012). As of March 1, 2016, 1638 laboratory-confirmed cases (587 deaths; 36% case fatality rate [CFR]) have been reported to the World Health Organization. A South Korean outbreak of MERS began in May 2015, and its transmission continued until early July, resulting in 186 laboratory-confirmed cases with 38 deaths (20.4% CFR). In contrast to previous studies, which have suggested limited person-to-person transmissibility of MERS-CoV (Breban et al. 2013; Cotten et al. 2013b), many secondary and tertiary cases of transmission occurred during the South Korean outbreak. Importantly, more than half of the tertiary cases were transmitted from one particular super-spreader, called Patient 14 in this study. This unusual transmission pattern raised questions related to transmissibility as well as the potential adaptations of MERS-CoV to the human host. To address these questions, several researchers have investigated MERS-CoV sequences. However, all previous studies on the South Korean outbreak have focused only on the consensus sequences of MERS-CoVs (Wang et al. 2015; Kim et al. 2016a,b; Park et al. 2016; Seong et al. 2016).
The evolutionary dynamics of RNA viruses are complex because of their high mutation rates, rapid replication rates, and large population sizes. Multiple rounds of replication of a given viral genome can generate a cloud of diverse variants or a heterogeneous viral population. Previous reports on other RNA viruses have suggested that quasispecies diversity, rather than the selection of individual variants, correlates with pathogenicity and enables adaptation to new environments (Vignuzzi et al. 2006; Lauring and Andino 2010). The importance of characterizing viral populations as swarms with similar variants has led several groups to use next-generation sequencing (NGS) technologies on clinical samples to explore the complexity of the spectrum of mutant viruses, including HIV-1 and hepatitis B virus (Bushman et al. 2008; Eriksson et al. 2008; Margeridon-Thermet et al. 2009). Although the description of MERS-CoV quasispecies is far from completion, previous studies that recovered whole MERS-CoV genome sequences from dromedary camels revealed the existence of intrahost single-nucleotide variants (Briese et al. 2014; Borucki et al. 2016).
To date, no intrahost MERS-CoV single-nucleotide variants have been identified in humans, except by Cotton et al. (2013a), where data from one specimen showed the presence of non–consensus variants (<5%). Currently, it is hard to discern whether only consensus sequences have been reported or whether human MERS-CoV sequences represent almost clonal virus populations within individual cases. By analyzing non–consensus sequences in 35 specimens from 24 patients infected during the South Korean outbreak (2015), we demonstrated the existence of intrapatient heterogeneity and investigated its functional implications.
Patients and Sample Collection
Among the 24 cases, 14 were patients, four were caregivers, and six were health-care workers (HCWs) (Table 1). Patient 14 was a second-generation case who had been exposed to the index patient. Exposure to Patient 14 led to a second wave of the outbreak, resulting in a total of 81 third-generation infections (Oh et al. 2015). Specimens from 20 of these third-generation cases were analyzed in this study (Supplemental Fig. S1). Patients 162, 164, and 169 were HCWs and fourth-generation case patients who had been exposed to Patient 135.
To determine whether viral genomic variability was related to disease severity, patient cases were categorized as mild (symptomatic cases without pneumonia; n = 3), moderate (cases with pneumonia; n = 11), and severe (cases with respiratory failure or death; n = 10) (Table 1). In eight cases, multiple lower respiratory tract specimens were obtained (Supplemental Table S1). Specimens from patients 50, 77, and 135 were obtained before and after respiratory failure.
Characterization of MERS-CoV Genome
Specimens from eight patients who had been identified as MERS-CoV-positive were sequenced. The samples were found positive by both upstream-of-the-envelope gene (upE) and open reading frame 1a (orf1a) real-time polymerase chain reaction (PCR) assays, with cycle threshold (Ct) values averaging 17.3 (15.4–22.6) and 18.3 (15.4–23.0), respectively (Supplemental Table S1). For each patient, a MERS-CoV consensus sequence that indicated the major nucleotide at each genomic position was generated. The genome sequences of the eight isolates differed from each other at only seven positions (Table 2). The nucleotide substitutions were all nonsynonymous and occurred in the orf1ab (n = 3) and S (n = 4) genes (Table 2). The sequences of the eight isolates had high nucleotide identities (ranging from 99.96% to 100%, with 99.98% to 100% sequence identities in ORF 1a and 1b [orf1ab] and 99.85% to 100% identities in the spike glycoprotein gene) with recently published sequences of MERS-CoV isolates from the outbreak in South Korea (Kim et al. 2015; Lu et al. 2015; Seong et al. 2016). The E, M, and N genes were 100% identical to the previously described MERS-CoV isolates from South Korea. The sequence from Patient 14 in this study differed from the sequence from a previous study at two positions (Seong et al. 2016). Because the frequencies of the major nucleotides at these positions were close to 50% in this study, the differences between the consensus sequences may have arisen from small technical variations in measurement.
Using 105 previously published genome sequences of MERS-CoVs, including those from recent cases in South Korea, we analyzed the phylogenetic relationships of eight newly sequenced isolates. The new sequences clustered together with the other MERS-CoV isolates from the 2015 South Korean outbreak (Supplemental Fig. S2; Kim et al. 2015; Lu et al. 2015; Seong et al. 2016).
Presence of Intrapatient Heterogeneity in MERS-CoVs
Analyzing the MERS-CoV genome sequences, we noticed that a significant number of nucleotide sites had mixed bases. Because deep sequencing not only generates a consensus sequence but also identifies non–consensus nucleotides at each position, Cotton et al. (2013a) analyzed nucleotide variants present at >1% frequency to determine the variants important for intrahost evolution. In the previous study, the nucleotide variants present at a frequency greater than the sequencing error rate (i.e., 1%) were considered to be true variants, although this estimate may be conservative for most applications. However, errors introduced during reverse transcription and PCR amplification can significantly distort estimates of allele frequencies, especially if a limited amount of RNA is used as an input. Thus, in this study, we evaluated the reproducibility of allele frequencies of non–consensus nucleotides. For this purpose, duplicate libraries generated by independent complementary DNA (cDNA) synthesis using the same RNA samples were sequenced and compared. Allele frequencies of duplicate samples from Patients 80 and 162 were significantly correlated, implying that the majority of mixed bases were a reflection of intrapatient heterogeneity rather than a result of sequencing errors (Fig. 1A,B).
Despite the relatively high reproducibility, we focused on nucleotide variants present at a relatively high frequency (i.e., ≥30%) to minimize false positives among non–consensus variants. Positions where a nucleotide variant was present at a frequency of ≥30% in any of the eight isolates were selected for further investigation using additional samples. Targeted deep sequencing was performed on an additional 27 samples targeting 3003-bp regions that included the variable sites. Eleven libraries from independent cDNA syntheses of five samples, including four libraries in duplicate and one in triplicate, were constructed to evaluate the technical noise in measuring allele frequency. Only variants with frequencies significantly greater than those from technical noise were considered true variants. Technical noise is described in Methods and was based on a 5% significance level after performing Bonferroni adjustment. A total of 16 positions displaying intrapatient heterogeneity were identified through statistical testing. The results from seven specimens were validated using Sanger sequencing. We tested nine sites and confirmed concordant mixed bases at seven positions—namely, positions 6884, 7317, 11257, 21726, 22356, 22984, and 23041 (Fig. 1C), but found discrepancies at positions 7322 and 19075. Although we were not able to validate all variants by Sanger sequencing owing to a limited sample volume, the results from the test set showed a 78% validation rate. In addition, the frequencies of the two variants at positions 22984 and 23041 (i.e., G at position 22984 and T at 23041; A at 22984 and C at 23041) were highly correlated in the specimens (Supplemental Fig. S3), suggesting a tight linkage between these variants instead of random sequencing errors.
Fourteen positions exhibiting mixed bases are presented in Figure 2. These include all seven nucleotide variants identified in the full-genome sequences of the eight MERS-CoV isolates, indicating that intrapatient MERS-CoV heterogeneity was closely associated with isolate-specific variants in the consensus sequences. Taken together, our data demonstrate the intrapatient heterogeneity of MERS-CoVs.
Interactions Between Genetic Variants
As described above, there was a tight linkage between positions 22984 and 23041—the variant at one position and the wild type at the other position or vice versa (i.e., D510G and I529; D510 and I529T). Based on the genetic linkage, a functional relationship was expected between the variants at these two positions. In fact, these two mutations were located in one of the two major binding patches in the receptor-binding domain (RBD), where escape mutations from neutralizing antibodies were previously reported (Tang et al. 2014).
Based on the tight correlation of base frequencies at positions 22984 and 23041, our data suggested that the double mutant carrying D510G and I529T is rare. Therefore, we selected sequencing reads covering both positions in each specimen to measure the frequencies of the double mutants (i.e., D510G, I529T), the single mutants (i.e., D510G, I529 and D510, I529T), and the wild type (i.e., D510, I529). On average, the frequency of the double mutant was only 1.4% ± 0.3%, whereas the single mutants and the wild type were present at 87.7% ± 1.9% and 6.5% ± 1.7%, respectively (Fig. 3). Recently, both D510G and I529T mutations in RBD were shown to reduce its affinity for human CD26 compared with the wild-type RBD (Kim et al. 2016c). Although the previous study did not test the binding affinity of the double mutant, in our data the two mutations are mutually exclusive, suggesting that the double mutant severely impairs viral fitness. Notably, although the frequency of each single mutant varied greatly among specimens, the combined frequency of both single mutants was consistently high in most samples. At the same time, the frequency of the wild type was no more than 10% in most samples (31 of 35 specimens). Thus, our data suggest that there was strong selection pressure favoring the variants of the spike glycoprotein with reduced affinity for the host receptor.
Another notable nonsynonymous substitution was A107V (at position 11257), which occurred in the nonstructural protein 6 (nsp6) coding region within orf1ab. This A107V variation in nsp6 and the I529T substitution in the spike glycoprotein were frequently found in the South Korean isolates and appeared to be genetically linked (Fig. 2; Supplemental Table S2). Nsp6 is a membrane-spanning protein and is an integral component of the viral replication complex involved in double-membrane vesicle formation (Lundin et al. 2014). Although the functional importance of these MERS-CoV variants in viral replication and their interaction with the host immune system remain to be elucidated, our data showed that the selection of these variants may not be independent from each other, suggesting the unit of selection may be a combination of variants rather than individual variants.
In this study, we demonstrated the intrapatient heterogeneity of MERS-CoVs, ruling out the possibility of technical noise. Based on the statistical test described above, we first selected 16 non–consensus variant candidates with frequencies significantly higher than those from technical noise. Among those variant candidates, we tested nine candidates by Sanger sequencing and validated seven variants, resulting in a 78% validation rate. After removal of the two nonvalidated candidates, we ultimately listed a total of 14 non–consensus variants, expecting an 89% validation rate. There was a tight correlation between the frequencies of the two variants at positions 22984 and 23041. These results suggest that technical artifacts were highly unlikely to have generated such non–consensus variants. Our analysis was limited to relatively high-frequency variants because the sequencing error rate presents certain limitations for the study of RNA as opposed to DNA viruses. Technical advances such as the use of unique molecular identifiers to remove PCR errors may produce more accurate descriptions of MERS-CoV populations in patients (Kinde et al. 2011). Despite the limitations, we demonstrated the presence of heterogeneous MERS-CoV populations in clinical samples from MERS-CoV-infected patients.
Given the large number of cases that resulted from exposure to Patient 14, it was intriguing that Patient 14 displayed the highest intrapatient heterogeneity. Among the 10 single-nucleotide variants that were observed in more than one patient, six variants were present at frequencies >0% in Patient 14 (Fig. 2). Thus, the majority of intra- and interpatient heterogeneity observed in different patients seemed to originate from the variants present in Patient 14. Our results showed that intrapatient heterogeneity can be transmitted in some cases, indicating that humans can be simultaneously infected with more than one MERS-CoV.
Because most of the mixed bases observed in Patient 14 were not completely fixed in the subsequent generation of cases, the single-nucleotide variants might not have had significantly higher fitness than wild type. For example, the sequence from Patient 14 had a mixed base C/T at position 11257. Among sequences from patients exposed to Patient 14, a subset displayed either C or T alone with little intrapatient heterogeneity, whereas others showed similar intrapatient heterogeneity to Patient 14. In addition, we could not find any significant differences in genetic variants based on disease severity group (Supplemental Tables S3 and S4). Taken together, genetic variant composition in each patient varied, regardless of the transmissibility or disease severity, suggesting that transmission of individual genetic sequences was stochastic rather than selective. Even if the genetic variants had a selective advantage, the advantage of individual genetic sequences might have been weak and/or varied depending on the patient.
Although we could not provide a clear picture of the functional interactions among distinct genetic variants in the population, the high ratio of nonsynonymous to synonymous substitutions offered a clue to their functional impact. The 14 variants consisted of 12 nonsynonymous and two synonymous substitutions. Assuming that the unit of selection might not be an individual variant but the population as a whole, we calculated dN (the number of nonsynonymous changes per nonsynonymous site) and dS (the number of synonymous changes per synonymous site) by the Pamilo–Bianchi–Li method (Li 1993; Pamilo and Bianchi 1993). The dN/dS ratio was 1.03, hinting at the possibility of positive selection. Notably, six of the 12 nonsynonymous substitutions were found in the spike glycoprotein-coding region. As quasispecies theory proposes, the unit of selection might not be an individual virus but the population as a whole. Our data showed that the frequencies of the single mutants (D510G, I529 and D510, I529T) significantly fluctuated among specimens, but the combined frequency of the single mutants was consistently high in most samples, suggesting a combination of variants as the unit of selection.
Recently, in an analysis of 13 MERS-CoV genomes associated with the 2015 outbreak in South Korea, Kim et al. (2016c) reported that 11 of those genomes had an I529T mutation in RBD, and one had a D510G mutation. The study showed that D510G and I529T mutations resulted in reduced affinity of RBD for the human CD26 receptor compared with the wild-type RBD. Based on the spread of a mutant MERS-CoV with reduced affinity for the receptor, the authors suggested that MERS-CoV adaptation during human-to-human spread might be driven by host immunological pressure such as neutralizing antibodies (Kim et al. 2016c). Consistent with the previous analysis, our data showed that the I529T and D510G mutations were observed in the consensus sequences from 29 and 4 of 35 samples, respectively. Furthermore, our analysis at an intrapatient level showed that the frequency of the wild type at both positions was only 6.5% ± 1.7% on average, supporting a strong selection pressure favoring the variants over the wild type.
Thus, we questioned whether the selection pressure was exerted by a host immune response such as neutralizing antibodies, as previously suggested (Kim et al. 2016c). In such cases, a reduction in host immune pressure might increase the frequency of the wild type. Four specimens displaying relatively high frequency (i.e., >10%) of the wild type belonged to Patients 77 and 80. We noticed dramatic changes in routine blood test results such as white blood cell (WBC) counts and C-reactive protein (CRP) levels in these patients. We analyzed 19 serial specimens from eight patients. For this, we used normalized WBC count values expressed as a percentage of the first of a series of samples from each patient. Serial samples from Patients 77 and 80 displayed a dramatic decrease in the normalized WBC count and a simultaneous increase in the frequency of the wild-type allele (Supplemental Fig. S4). Whereas WBC counts significantly decreased in those patients, the CRP level increased during the period of MERS-CoV infection, indicating an impaired immune response. Although these data are limited owing to the small sample size and the lack of direct measurement of host immunological pressure, our results suggest that the selection pressure exerted by the host immune response might favor variants with reduced affinity to the host receptor; therefore, a reduction in this selection pressure resulted in the expansion of viruses with the wild-type allele. The characterization and quantification of neutralizing antibodies in patients over time is required to determine their association with the mutants and to validate the hypothesis. Moreover, more accurate descriptions of MERS-CoV populations within patients will advance our understanding of the complex molecular evolution of MERS-CoVs.
Because the evolution of a virus is a continuous process that takes place during intrapatient infection and interpatient transmission, the evolution of MERS-CoVs should be studied at both intra- and interpatient levels to obtain a better understanding of the principles underlying the complex molecular evolution of MERS-CoVs in natural populations. However, our knowledge of MERS-CoV evolution primarily depends on analysis at the interpatient level, ignoring genetic diversity within individual patients. In this study, we demonstrated intrapatient heterogeneity in human MERS-CoV isolates. Based on the analyses of the genetic diversity of MERS-CoVs at both the intra- and interpatient levels, our results shed light on the evolutionary dynamics of MERS-CoVs associated with the South Korean outbreak.
Collection of Clinical Specimens
A total of 35 clinical specimens from 24 patients were determined positive for MERS-CoV and were included in this study. Clinical information and information about possible exposure to other MERS-CoV patients was collected from the electronic medical records and publicly available data from multiple sources including the South Korea Centers for Disease Control and Prevention and the South Korea Ministry of Health and Welfare. These data included age, gender, epidemiologic link, dates of suspected exposure to MERS patients, initial symptoms, date of symptom onset, and clinical courses.
MERS-CoV Real-Time Reverse Transcription PCR Assays
RNA was extracted from clinical specimens using either a QIAamp DSP Viral RNA Mini Kit (Catalog No. 61904, QIAGEN GmbH) or an automated MagNAPure 96 extraction instrument with a total nucleic acid isolation kit (Roche). The extraction was performed according to the manufacturers’ instructions and the extracted RNA was stored at −70°C.
MERS-CoVs were detected in specimens using real-time reverse transcription (rRT)-PCR. Extracted RNAs were screened by targeting the upE region, and positive results were confirmed by a subsequent amplification of orf1a using a PowerChek MERS Real-Time PCR kit (Kogene Biotech, Seoul, South Korea). All rRT-PCR reactions were performed using the 7500 Fast Real-Time PCR System (Applied Biosystems) with a total reaction volume of 20 μL (15 μL of PCR reaction mixture and 5 μL of template RNA). The thermocycling conditions included a reverse transcription reaction for 30 min at 50°C, followed by 10 min at 95°C, and then 40 cycles of 15 sec at 95°C and 60 sec at 60°C. A positive viral template control and a nontemplate control were included in each run. The glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene was amplified simultaneously as an internal control to monitor PCR inhibition. A positive result was identified by a well-defined exponential fluorescence curve that crossed the defined threshold at ≤35 cycles in both the upE and ORF1a assays.
Reverse Transcription and PCR Amplification for Sequencing
Five microliters of viral RNA from each sample was used as a template for cDNA synthesis using a Superscript III First-Strand Synthesis System (Life Technologies). Equal amounts of cDNA product were used to perform PCR with a Herculase II Fusion DNA polymerase (Agilent Technologies). For full-genome sequencing, primers described by Cotten et al. (2013a) were used with minor modifications for efficient amplification. A set of 60 primers was used to generate fifteen 2.5-kb overlapping amplicons (four primers per amplicon) and three additional primers were added for the extreme termini of the genome (Supplemental Table S5). Primers used for the targeted sequencing are listed in Supplemental Table S6. Primers and nucleotides were removed from the final PCR products using a MiniElute PCR purification kit (QIAGEN).
The PCR products from each patient were combined into a pool of approximately equal molarity and subjected to paired-end library construction with a TruSeq Nano DNA Sample prep kit (Illumina). Libraries were sequenced using an Illumina MiSeq sequencer (Illumina), generating 150-bp paired-end reads. On average, 4113× and 9361× raw read depths were achieved for full-genome and targeted deep sequencing, respectively. Sequencing metrics including mean depth of coverage of each sample are summarized in Supplemental Table S7. Bases were called with MiSeq reporter software (ver. 2.4.60). The reads were quality filtered using the command “fastq_quality_filter,” which required a minimum of 90% bases with a quality score of 20 or higher. Paired-end reads were aligned with the National Center for Biotechnology Information (NCBI) MERS-CoV reference sequence (NC_019843.3) using the Burrows–Wheeler Aligner (BWA) (Li and Durbin 2010). SAMtools v0.1.19 (Li et al. 2009), GATK v2.4-7 (McKenna et al. 2010), and Picard v1.93 were used for sorting SAM/BAM files, local realignment, and duplicate markings, respectively. We used the SAMtools “mpileup” command to determine the read depth. The consensus sequence was determined from the most commonly expressed nucleotide at each position.
We downloaded a total of 105 human MERS-CoV genome sequences from the NCBI database (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/Database/nph-select.cgi?cmd=database&taxid=1335626). Combining the eight MERS-CoV genome sequences from our study with the downloaded sequences, a phylogenetic analysis was performed after aligning the sequences using MUSCLE (http://drive5.com/muscle). A total of 30,130 aligned nucleotides were input into MEGA 7.0 (http://www.megasoftware.net) (Tamura et al. 2013), and a phylogenetic tree was constructed using the neighbor-joining method. The GenBank accession number followed by a brief description of the sequence, including the virus variant name and the country of isolation, is provided for each sequence in Supplemental Figure S2. Evolutionary distances were computed using the Kimura two-parameter model. The scale bar at the bottom of the figure indicates nucleotide substitutions per site.
Determination of Technical Noise and Significance Test for Variants
We used Xijbr to denote the allele frequency for the ith sample, the jth chromosomal position, the base (A, T, G, or C), and the rth replicated experiment. If the total depth was <100× at a given ijr, we treated it as missing, irrespective of the b value. We defined the background noise value (Nijb) as the average of the absolute differences between replicated data for all combinations:
Each variant allele frequency was tested by Z with a mean of
and a variance of
when replicates were available for only n samples.
Data Deposition and Access
The seven full-length and one partial virus genomes were deposited at GenBank (http://www.ncbi.nlm.nih.gov/genbank/) under accession numbers KX034093–KX034100. Raw sequencing data were deposited in the Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra) under accession number SRP072910.
This study was approved by The Samsung Medical Center (Seoul, South Korea) institutional review board (IRB) (no. 2015-07-005) and all methods were carried out in accordance with the approved guidelines. The IRB approved a waiver to document informed consent because the study planned to (1) use residual viral RNA samples without further collection of clinical samples, (2) analyze only viral RNA without human genome analysis, and (3) anonymize and de-identify the specimen and patient information before the analysis.
We thank the technical staff of the Samsung Genome Institute for next-generation sequencing.
D.P. and H.J.H. managed data processing, designed the experiment, and coordinated the analyses. Y.J.K. designed the experiment. Y.J.K. and H.-J.J. performed the experiments. D.-S.S. performed data processing, analyses, and statistical tests. E.-H.I. analyzed the consensus sequence data. J.-W.K., N.Y.L., E.-S.K., C.I.K., and D.R.C. collected the clinical data and analyzed the data. J.-H.A., S.S.C., and Y.-J.K. advised on analyses and the manuscript. K.R.P., C.-S.K., and W.-Y.P. conceived of the study and participated in its design. D.P., H.J.H., C.-S.K., and W.-Y.P. wrote the manuscript. All authors read and approved the final manuscript.
This study was supported by a research grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI13C2096 to W.-Y.P.).
Competing Interest Statement
The authors have declared no competing interest.