High-Throughput Sequencing (HTS), also known as Next-Generation Sequencing (NGS), has many advantages for pathogen detection as compared to traditional methods such as Polymerase Chain Reaction (PCR), serological assays, and/or culture-based methods. Metagenomic sequencing is the high-throughput sequencing of nucleic acid from complex samples rather than from purified microorganisms. Metagenomic sequencing is much less biased than other methods and allows for the detection of fastidious or nonculturable organisms as well as multiple unrelated pathogens within a single sample. Moreover, detection via HTS is much less susceptible to false-negative results caused by antigenic drift or signature erosion. Despite these advantages, one of the technical challenges encountered with respect to metagenomic sequencing is obtaining adequate depth and breadth of coverage from pathogens like RNA viruses that i) typically have small genomes, and ii) are typically present at low titers amidst the background ‘noise’ of the host and commensals. Genome size directly affects sensitivity of detection by HTS because the sampling of sequence fragments within a sample depends on the prevalence of those fragments and organisms with larger genomes typically contribute more fragments, therefore being sampled more often than organisms with smaller genomes. In other words, organisms with larger genomes have the potential to contribute a larger proportion to the overall number of sequencing reads even when the plaque-forming units (PFU), or colony-forming units (CFU) in the case of bacteria, are equivalent to that of an organism with a smaller sized genome.
Although conventional shotgun sequencing allows for the detection of all domains of life, it rarely returns robust coverage of a small viral genome when taken from a very complex sample. A variety of possible strategies exist to enhance the sensitivity of HTS for virus detection and characterization, including purification of specific viral fractions by physical methods such as filtration and ultracentrifugation, amplicon-based target enrichment, and hybridization-based target enrichment. Purification of viral fractions is ideal in some cases, although it can be laborious, and for certain sized samples (for instance clinical samples of very limited volume) it may not be realistic. The use of hybridization-based target enrichment could be preferable to the aforementioned technologies because it has the potential to yield sequence data covering the entire genome of multiple viruses with just one sequencing reaction by using genome-wide probes designed against multiple viruses to specifically select for viral cDNA prior to sequencing. Amplicon sequencing of viral genomes is a technique that has been widely used, but it has some disadvantages, which were articulated by Metsky et al. in a recent study of Zika virus (ZIKV). First, traditional amplicon sequencing typically requires technically challenging normalization and pooling of individual amplicons to cover the entire genome of one specific virus. However, recently a protocol was published for efficient amplicon primer design and multiplex amplicon generation in a single tube for sequencing in the MinION or Illumina platforms. Although this method obviates the amplicon normalization and pooling steps and is effective for producing whole genome sequence data from a low titer ZIKV sample, this method has not been demonstrated for production of whole genome sequence data for multiple diverse viruses from a single complex sample. Additionally, amplicon sequencing typically requires as much as 40 cycles of PCR amplification [4–6], which can introduce sequence errors. Furthermore, amplicon-based sequencing is vulnerable to false negative results caused by mutations in primer binding sites, as was recently demonstrated for Dengue virus (DENV) as well as false positive variant results possibly caused by low and/or uneven coverage. By contrast, the use of probes tiled along the entire length of a viral genome to hybridize and select for virus-specific fragments has the potential to produce less false-negative pathogen detection results by virtue of many more potential binding sites along an individual genome and resulting more uniform coverage.
Ebola virus (EBOV) is one specific example of a pathogen for which false-negative PCR results can have devastating consequences and for which available PCR-based assay effectiveness has been shown to be affected by drift. Therefore, a panel of 80-mer oligonucleotide probes designed against eight Filovirus genomes was recently used for post-sequencing library enrichment in a HTS-based study of a recent EBOV outbreak in West Africa and in an investigation of potential genetic variation of EBOV in experimentally infected nonhuman primates [9–11]. In this protocol, viral enrichment is coupled with the RNA Access kit, developed by Illumina, Inc. The technical advancements of the RNA Access kit had already enabled the sequencing of previously unsequencable materials such as those of low concentration and formalin-fixed paraffin-embedded (FFPE) tissue, and now this protocol has been employed not only for the detection and characterization of EBOV from clinical samples but also for detection and characterization of respiratory viruses in clinical samples [13, 14]. In general, hybridization-based viral target enrichment has been successfully employed to characterize viruses found within both contrived samples and clinical samples [3, 9, 10, 13, 15–20]. The performance of the Respiratory Virus Panel (RVP) version of this method, which uses probes for 34 common respiratory viruses in conjunction with the TruSeq RNA Access protocol, was recently investigated and it was demonstrated to work well overall when tested on human clinical samples. Specifically, the authors reported successful enrichment for 30 of 33 human clinical samples tested. Importantly, RT-PCR was conducted on those same samples and Ct values of respiratory viruses in those clinical samples ranged from 21 to 33, which provides a framework for beginning to assess the limits of detection of hybridization-based enrichment sequencing. Herein, we extend this approach by i) expanding this viral probe panel to include viruses of biosurveillance and biodefense concern and ii) employing carefully constructed mock clinical samples to systematically assess this technique’s performance in a variety of conditions, such as deeper multiplexing for cost effectiveness as well as more extensive co-infection scenarios. We demonstrate the sensitivity and reproducibility of hybridization-based viral enrichment sequencing despite virus divergence and we show that this sensitivity is maintained even with extensive multiplexing of samples to decrease cost. Herein we demonstrate that within one reaction tube, this technique can even be used to detect and discriminate between multiple serotypes of a virus within a clinical sample or to detect and discriminate amongst multiple unrelated viruses that present similar clinical symptoms, and we demonstrate this performance at clinically relevant concentrations of virus.
Hybridization-based target enrichment enhances sensitivity of HTS for detection of virus in complex environmental samples
Enhanced detection of viruses from various clinical sample types using filovirus- or respiratory virus-specific probes has recently been demonstrated [9, 10, 13]. To expand the range of viruses that could be detected, in this study those two probe panels were combined with new probes for 41 additional viruses that are of biosurveillance and biodefense concern, for a full panel targeting 83 diverse viruses (Table 1). In order to test this newly expanded probe panel and to specifically assess the effect of hybridization-based viral enrichment on the sensitivity of HTS for detection of a single virus within a complex environmental sample, commercial bat guano was spiked with increasing concentrations of Influenza virus (IFV). Spiked samples were split into two parts each, with each part being processed in parallel with unbiased, shotgun sequencing versus target enrichment sequencing using an expanded panel of probes.
As expected, a dose-dependent effect in the proportion of sequencing reads derived from IFV was observed as the number of spiked genome copies increased (Fig. 1a and Additional file 1: Table S1), in both the unbiased shotgun sequence data as well as the virus enriched sequence data. However, in this context, hybridization-based target enrichment resulted in approximately 20- to 100-fold more sensitivity for detection of IFV as compared to detection via unbiased, shotgun sequencing. At the lowest concentration tested (1250 genome equivalents (GE) per mL), only 0.5% of sequencing reads produced by unbiased shotgun sequencing were derived from IFV (‘on target’ reads), whereas by stark contrast, the majority of reads produced by target enrichment sequencing (54.4%) were derived from IFV.
Given the dramatic increase in sensitivity observed when complex samples were spiked with an individual virus’s genetic material and subjected to target enrichment, we next sought to evaluate whether these effects would still be observed in the presence of an additional virus and at lower concentrations of IFV gRNA overall. Therefore, IFV gRNA was spiked into total RNA derived from Middle Eastern Respiratory Syndrome Coronavirus (MERS-CoV) cell culture lysate at an overall lower range of increasing concentrations than IFV was spiked in the prior experiment. As before, the samples were aliquoted into two parts that were processed by each method. In these synthetic co-infection samples, target enrichment sequencing resulted in a simultaneous increase in sensitivity for both viruses as compared to unbiased, shotgun sequencing (Fig. 1b). As expected, a constant high proportion of reads mapping to MERS-CoV was observed and the proportion of reads mapping to IFV increased in a dose-dependent fashion with the number of genome equivalents spiked (Additional file 2: Table S2).
Detection and discrimination of related viruses in clinical samples
We next tested the sensitivity for detection of three clinically-relevant viruses that can co-circulate in tropical regions, can present with similar symptoms, and can be difficult to detect at low titers. Mock clinical samples were constructed containing combinations of ZIKV, CHIKV, and DENV at loads that correlate with real clinical loads from human specimens. Briefly, varying titers of ZIKV, Dengue virus 2 (DENV-2), and Chikungunya virus (CHIKV) were spiked into RNA extracted from human serum, in duplicate, to create synthetic co-infection samples. Negative control samples consisted solely of RNA extracted from human serum. Viruses were spiked-in at concentrations corresponding to Ct values from standard curves generated via RT-qPCR. The targeted spike-in values were chosen based on reports in the literature for clinical samples containing each virus to mimic a realistic co-infection scenario [22–25]. Given that in the literature there is at least one report of clinical samples being probed singly rather than pooled and probed and it is not well known how pooling may affect virus detection levels, in this experiment we also sought to evaluate whether probing singly or within a pool would affect our ability to identify virus. Therefore, total RNA extracted from these samples was probed singly (“pool of 1”) and also pooled in groups of four and 12 with singly-spiked and mock-spiked serum samples consisting of the other components of the pool. The resulting sequence reads were mapped to the reference genomes for each of these three viruses. In all cases, the three co-infecting viruses were able to be detected in each sample at relatively consistent proportions regardless of the number of samples within a pool (Fig. 2a). Even DENV-2 was detectable within each sample it was spiked, despite the low concentration of viral RNA (estimated 100 genome equivalents per mL).
Although each targeted virus was represented by enough sequencing reads to be easily detected, there were differences in the depth and breadth of genome coverage observed. Whereas the full CHIKV genome was recovered from all spiked samples at a very high depth of coverage (Fig. 2c), in the case of DENV-2, an average of 92.2% of the genome was recovered from co-infected samples (Fig. 2d) and the average ZIKV linear genome coverage was lower, at 68.8% (Fig. 2b). These patterns in coverage were similar for both replicates (Additional file 3: Figure S1). Overall, the proportion of reads mapping to a given virus was consistent and reproducible regardless of whether a sample was probed singly or probed within a group of four or 12.
In addition to evaluating whether sensitivity and reproducibility are maintained despite multiplexing in pools of four and 12, we also sought to evaluate the probe panel’s performance in the context of strain-level, and even species-level, genetic variation as well as differing concentrations of viral genetic material. Specifically, synthetic clinical samples were also constructed to contain a different strain of ZIKV than the strain the probes were designed against (strain R116265 rather than strain MR766, which is the strain whose reference genome was used for probe design; Fig. 3a and b) and a different species of Human Adenovirus (HAdV) than the probes were designed to target (HAdV-51, a member of species D; as opposed to species C, B, and E, which the probes target specifically; Fig. 3g and h). These samples were also constructed to include biological replicates and were probed singly or in pools of four or 12. In all cases, the spiked-in virus was detectable, although there was some variation in depth of coverage among multiplexed samples. As might be expected, samples that were multiplexed in sets of 12 yielded the lowest depth of coverage compared to samples that were multiplexed in sets of four or probed singly (Fig. 3b, d, f). The target genomes were completely covered in the majority of on-target samples. The exception to this rich, consistent coverage included both strains of ZIKV, which, although both were detectable, did not achieve 100% linear coverage and exhibited lower depth of coverage than the other viruses that were spiked in at similar levels (Fig. 3b).
It should be noted that in this experiment, both purified RNA as well as total nucleic acid samples containing HAdV-4 and HAdV-51 genetic material were processed and sequenced (Fig. 3g and h). The total nucleic acid samples were processed without a DNase step to allow for potential detection of both the DNA viral genome as well as viral transcripts. In the case of both HAdV-4 (species E) and HAdV-51 (species D, not targeted specifically by probes), the vast majority of the resulting sequencing reads were virus-specific and the reads were well-distributed along the length of the genome in coding regions, and in the case of the total nucleic acid samples, noncoding regions as well (Fig. 3h), a phenomenon that was consistent between replicates (replicate coverage data is shown in Additional file 4: Figure S2).
Strain-specific detection of DENV at titers below limit of detection by conventional shotgun HTS or amplicon-based sequencing
It can be difficult to detect DENV-1 and DENV-2 in clinical samples when the Ct value crosses above 29. Therefore, spiked samples were created using two serotypes of DENV with Ct values corresponding to low titer, and the samples were subjected to hybridization-based enrichment and sequencing. The resulting sequence reads were found to cover the entirety of each target genome, even for the samples corresponding to Ct value 32 (estimated 1000 genome equivalents per mL). A dose-dependent response was observed in the percentage of DENV-specific reads as the Ct value decreased (Fig. 4a). For each serotype, the depth of coverage was greater than 50x even when the Ct value crossed 29 (Fig. 4b and c). As expected, the remaining reads that did not map to DENV-1 or DENV-2 were derived from human genes in spiked-in human serum RNA extract that were pulled down by the control probes, as well as the sequencing control library for PhiX.
A major challenge faced in virus detection as well as virus sequencing is the difficulty to detect divergent strains of viruses typically present at low titers amongst a robust host or environmental background. Small viral genomes present at low concentrations are effectively drowned out by signal from host nucleic acid and from commensal microorganisms. A variety of methods have been employed to increase the viral signal in high-throughput sequence data, including amplicon sequencing, but for viruses like DENV, with its genome of less than 11 kb in size, even amplicon sequencing is regarded as an inefficient approach for samples with Ct values of 29 or higher. Such limitations have been of particular concern for U.S. Department of Defense (DoD) laboratories tasked with biosurveillance and biodefense activities in regions with limited material resources and human expertise. Part of the motivation for this effort was to provide DoD laboratories operating in austere environments new tools aimed at enhancing on-site sequencing capacity when engaged in Force Health Protection (FHP) activities.
We demonstrate here that hybridization-based viral target enrichment yields robust coverage of small genomes from clinical samples, even yielding full-length, deeply covered genomes at concentrations whereby current amplicon sequencing protocols may be expected to fail. Moreover, we demonstrate that hybridization-based target enrichment can allow for not only detection, but also genetic characterization such as strain-level discrimination, even at very low concentrations of virus. The capability to detect and discriminate between multiple serotypes of a virus within a complex sample at clinically relevant concentrations by using this enrichment method increases the utility of high throughput sequencing for biosurveillance and for infectious disease diagnostics. For both biosurveillance and clinical sequencing, assay cost and time are important considerations. We have demonstrated that more extensive pooling and multiplexing can be performed to reduce cost and time without sacrificing the assay’s ability to detect at least two strains of related virus and a variety of unrelated viruses in one sequencing reaction.
To date, the published viral target enrichment studies vary in focus and include characterization of EBOV during a recent outbreak in West Africa [9, 10] and detection of multiple viral families within clinical samples [13, 16]. While the probes employed in the studies published to date vary in length from 50- to 120-mers, enrichment methods also can differ by the number of probes and target viruses included in a set. An additional potential protocol difference is the number of samples pooled, which ranges from a single sample to 12 [9, 10, 15]. The current recommendation by Illumina for viral enrichment is to pool four samples.
The experiments described here systematically test enrichment of a single library as well as pools of four or 12 libraries and include a variety of titers of as many as three viruses within a single sample and as many as 12 samples within an enriched pool. For all conditions, even with more extensive pooling and multiplexing, we observed a dose-dependent response to varying Ct values even in co-infected clinical samples. A dose dependent response was also observed by O’Flaherty et al. in two co-infected samples containing Respiratory Syncytial virus and human Coronavirus OC43 spiked-in each at Ct 28 or 32. Interestingly, although there was the expected dose-dependent effect on the proportion of sequencing reads derived from IFV as the concentration of spiked IFV gRNA increased, there was a slight decrease in the proportion of MERS-CoV-derived sequencing reads in samples at the upper end of the IFV gRNA concentration range. This was not expected given that MERS-CoV was present in each replicate at a constant, high level. We hypothesize that this may be due to saturation of the streptavidin beads used to capture probe-cDNA hybridized fragments. Further experimentation will be required to test that hypothesis.
Efficacy of probes varies by homology to the viral target, as evidenced by our results and those in the literature [13, 15]. For example, the reference sequence used to design the probe set for DENV-1 exhibits 74% nucleotide identity over 35% of the length of the closest sequenced reference for the DENV-2 strain that was spiked. It is possible that this overlap, which is not shared by the DENV-2 probes and DENV-1 spike-in, contributed to an overrepresentation of DENV-1 reads in the experiments presented in Figs. 3 and 4. Additionally, by comparison to the other richly covered strains of virus tested in multiplexed samples, there was an underwhelming coverage for both strains of ZIKV. It is possible that the quality of RNA from these viruses was less than the other RNA spike-ins, or that the probe panel for ZIKV is less efficacious when used in combination with the entirety of the probe set. The probes for ZIKV were synthesized and added later after all the other probes were combined (in response to the recent outbreak) and therefore it is possible that the comparatively lower performance of the ZIKV probes is due to a difference in quantitation of the ZIKV probe set.
We observed that HAdV-51 (species D) genetic material was efficiently enriched even though the only adenoviruses used to design the probe panel were species B, C, and E. This experiment indicates that the protocol works as well for this particular DNA virus as it does for RNA viruses. Limits of detection may vary by viral target, which may explain why previously published experiments showed differences between DNA and RNA viruses. Nucleotide identity between human adenovirus species D (the species to which HAdV-51 belongs) and the other human adenovirus species HAdV-A, B, C, and E was reported by Kaneko et al. to range from 58.73 to 69.35%. In our study, the probe panel containing probes for species HAdV-B, C, and E effectively enriched for the entirety of the HAdV-D genome. This suggests that when using long (80-mer) probes designed against several species of virus, related non-targeted species may also be enriched without being specifically included in the panel, if the nucleotide identity among them is at least 60–70%, and if multiple related species are targeted by the probes in the panel (in this case three species). This cross-reactivity for related human pathogens could prove to be a useful feature, by allowing for enrichment of more relevant viruses without added cost spent to increase the number of probes.
Our findings demonstrate that breadth of coverage does not suffer from extensive pooling but that deeper depth of coverage is gained by limiting the number of samples pooled. Extensive pooling makes hybridization-based enrichment sequencing more economical. Viral target enrichment could be applied as an economical approach to sequencing viruses known to mutate quickly and therefore evade other assays, fastidious organisms, or complex samples of limited volume. For example, this method could be prescribed to a scenario in which multiple serotypes of a virus such as DENV are expected to be present in a sample but detection is prohibited via conventional methods such as amplicon sequencing due to low titers. Viral target enrichment designed for a broad panel of targets could also be useful to the infectious disease field by enabling detection of low-titer viruses present in clinical samples taken from patients suffering from symptoms of unknown etiology. Another applicable use of a broad probe panel could be to perform environmental sampling. The aforementioned applications often involve complex samples of limited volume, for which this method is ideal. An important caveat to this approach is that while viral target enrichment is an economical method by which to reduce background noise in a metagenomic sample, probe design requires prior knowledge of the closest-sequenced genome for each viral target. Amplicon sequencing may be the best approach for previously known samples and unbiased whole shotgun sequencing may be more appropriate for a virus-rich sample. None of these approaches obviates the use of amplification by polymerase chain reaction or the potential introduction of sequence errors, and so standard quality analyses by computational methods should always be employed. As the development and testing of probe sets for viral target enrichment expands and continues, the application of this technique could make HTS more economical for routine use in Force Health Protection activities including biosurveillance, biodefense and outbreak investigations.
Preparation of contrived metagenomic samples and nucleic acid extraction and quality control
IFV (H1N1) particles (A/Swine/Iowa/15/30; ATCC, Manassas, VA), MERS-CoV RNA (Jordan-N3/2012; NAMRU-3), HAdV nucleic acid extract from particles (RI-67 and Bom; ATCC, Manassas, VA), ZIKV RNA extract from particles (MR766 and R116265; ATCC, Manassas, VA), CHIKV RNA (gift from LTC Richard Jarman, Walter Reed Army Institute of Research), DENV-1 RNA extract from particles (TH-SMAN; ATCC, Manassas, VA) and DENV-2 RNA extract from particles (New Guinea C; ATCC, Manassas, VA) were spiked into relevant matrices to construct contrived metagenomic samples for testing.
To prepare guano samples five-gram quantities of commercial Jamaican bat guano (Planet Natural; Bozeman, MT) were placed in 50 mL of sterile-filtered Hank’s Balanced Salt Solution (HBSS), vortexed to mix, centrifuged at 3,100 x g for 10 min, and filtered sequentially through 0.45 μm and 0.22 μm filters prior to spiking with IFV particles. Post-addition of IFV, samples were centrifuged at 39,000 x g for three hours at 10 °C to concentrate spiked and native virus particles, supernatant was removed, and total RNA was extracted from the pellet using the QIAampViral RNA Isolation Kit (QIAGEN; Valencia, CA). After elution in 30 μL buffer AVE, a second elution using 20 μL of the eluate was performed.
To prepare cell culture matrix samples, a nucleic acid spiking approach was used. In this case, decreasing amounts of IFV RNA were spiked into a constant mass of total RNA that had been extracted from Vero cells infected with MERS-CoV. Aliquots of the same Vero cell culture were used for each sample. Genome equivalents of IFV spiked into samples were calculated based on RNA mass extracted from virus particles and genome size.
For serum samples, RNA was extracted from Human Serum (BioIVT, Westbury, NY) and mixed with viral RNA.
Total nucleic acid was extracted from adenovirus particles using the QIAGEN QiAMP MinElute Virus Spin Kit, omitting carrier RNA. The samples were eluted in 24 μl buffer AVE. Viral RNA was extracted from virus particles and human serum using the QIAampViral RNA Isolation Kit as described above. The Qubit double stranded DNA Broad-Range Assay Kit and the Qubit RNA Broad-Range Assay Kit (Thermo Fisher Scientific; Waltham, MA) were used to assay extracts.
Quantitative reverse transcription PCR
For quantitative reverse transcription PCR, SuperScriptIII RT/Platinum Taq Mix (Thermo Fisher Scientific; Waltham, MA), dNTP mix, MgSO4, ROX Reference Dye (Thermo Fisher Scientific; Waltham, MA), and the primers and probes listed in Table 2 were used to assay in the CFX Connect Real-Time PCR Detection System (Bio-Rad; Hercules, CA) using the following conditions: 50 °C for 15 min, 95 °C for two minutes, and 40 cycles of 95 °C for 15 s and 60 °C for one minute. Standard curves for DENV-1, DENV-2, CHIKV and ZIKV were generated from titrated viral RNA extract.
Library preparation, virus enrichment, and sequencing
For virus enriched sequencing, TruSeq RNA Access libraries were created as per manufacturer’s protocol (Illumina; San Diego, CA), with the following two modifications: i) rather than the standard CEX oligonucleotides that are designed for enrichment of human genes, a custom pool of oligonucleotides was used that includes probes along the entire genome length of 83 viruses (Table 1) as well as probes specific for several human house-keeping genes, and ii) in the second PCR amplification, 17 cycles were used rather than ten. Samples were probed singly or in pools of four or 12 and multiplexed for sequencing on the Illumina MiSeq platform using v3 chemistry, 2X75 bp read lengths.
For conventional HTS (shotgun) sequencing, TruSeq libraries were pooled and sequenced on the MiSeq platform using v3 chemistry, 2X300 bp read lengths. For the guano samples, these consisted of an aliquot of each of the TruSeq RNA Access libraries from the step prior to virus enrichment. For the MERS-CoV-Vero cell matrix samples, these consisted of shotgun libraries made with the TruSeq RiboZero Gold Library Preparation Kit (Illumina; San Diego, CA).
Virus enrichment probe design
A composite panel of 80-mer DNA probes was assembled using a previously described panel for respiratory viruses [13, 14, 20], a previously described panel for Filoviruses [9–11], plus an additional panel of newly designed probes for 41 viruses of biosurveillance and biodefense concern, for a total of 19,077 probes. The methods employed for capture oligo design were essentially as described in O’Flaherty at al, although the design varied somewhat across the target viral genomes. For instance, in the case of the previously described respiratory virus panel, the design was focused on coding regions. In general, genomes were tiled with capture oligos in a way so as to avoid low-complexity sequences and repetitive sequences. Probe spacing and overlap vary per virus due to attempts to design probes that cover multiple related virus strains resulting in overlapping tiled design around more variable regions, whereas regions more conserved among multiple strains resulted in probes more or less tiled end-to-end. All probes were biotinylated on the 5′ end. Sequences of viral capture probes are provided in Additional file 5.
Quality control, de novo assembly, taxonomic classification, and reference-based analyses were conducted using EDGE Bioinformatic software v 2.0 with default parameters and host removal of human reference GRCh38 and also with CLC Genomics Workbench v11 (QIAGEN Bioinformatics; Redwood City, CA). The reference mapping parameters in CLC were modified from defalt settings to 0.8 length fraction and 0.8 similarity fraction with global alignment and random mapping of non-specific matches. BLAST was also used to further investigate specific datasets. The ggplot2 R package was used to generate depth of coverage plots.