Diagnostic metagenomics with high-throughput sequencing (HTS) techniques continuously gains importance for broad and swift identification of pathogens in human, animal, and food samples1. While for known pathogens, highly sensitive and specific diagnostic methods like real-time quantitative PCR (qPCR) are in routine use and deliver reliable results, the identification of unrecognized pathogens, meaning unexpected or newly emerging pathogens or pathogens that are only distantly related with known ones, can be very difficult. In this respect, metagenomics using HTS are much more promising than routine diagnostics. Unrecognized pathogens, especially newly emerging zoonoses, may cause serious infectious diseases, and a delay of medical treatment or development of vaccines might have fatal consequences for the affected patients and animal stocks. Such delays can be caused by performing numerous laborious screening tests until the potential pathogen is found instead of a single comprehensive screening test. Prominent cases of emerging infectious diseases caused by novel or varying viruses for instance are the discovery of the Middle East respiratory syndrome (MERS) coronavirus in 20122, the tremendous Ebola outbreak in 20143, the report on a novel zoonotic bornavirus4, or the detection of the new Schmallenberg virus5 affecting domestic and wild ruminants. However, infectious diseases with clinical signs like high fever, diarrhoea, or encephalitis − often life-threatening − can be caused by very different infectious agents6,7, not only viruses. In such puzzling cases, a generic approach that works likewise successful and efficient for all pathogen groups, as sketched in Fig. 1, is essential. In a number of review articles1,8–12, valuable considerations for this approach have been summarized. In addition to the overall workflow outlined in Fig. 1, after metagenomic analysis, it is desirable to confirm the initial sequencing-based suspicion by other methods and in the ideal case by fulfilling the Henle–Loeffler–Koch postulates13 as for example done in case of the Schmallenberg virus5.
Metagenomics for generic pathogen detection, so-called diagnostic metagenomics, in its pure form is a broad and undirected approach to find gene sequences or sequence fragments of infectious agents within sequence data sets of the whole community of a sample generated by high-throughput sequencing. A crucial point when handling diagnostic sequencing approaches is to explicitly distinguish between i) high-throughput well-standardisable routine diagnostics for expected known pathogens and ii) diagnostic metagenomics for all including unrecognized pathogens. The first approach is easier to design based on the spectrum of known sequences, testing with specific oligonucleotide primers for pathogens of known identity14,15, known tropism, and maybe also of a known proportion within a sample. The latter one − used for unrecognized pathogens like in the case of Schmallenberg virus5 − should be a generic approach that can ideally be applied to all samples and pathogens. This approach can indeed be developed and tested using samples of known origin and pathogen content. In case of an emergency, however, there might be no information about nature and proportion of the pathogen and only clinical data could be available, and the sample is a closed book. For these cases, metagenomics is optimally applicable as a sophisticated all-in-one solution even in cases where the nature of the pathogen is not known. If sample preparation is designed to be as unspecific as possible to capture all nucleic acids regardless of their source, this approach is applicable simultaneously for viruses, bacteria, and parasites since all three pathogen groups retain their genetic information in form of nucleic acids. Depending on the sample type, the anticipated pathogen, and the research question, one could extract either DNA or RNA. DNA is suitable for most purposes. However, when metagenomics is used for the detection of unrecognized pathogens, it is recommendable to use RNA as initial template to avoid a priori exclusion of RNA viruses. Targeting RNA will not only capture all cellular organisms but also many relevant RNA viruses, e.g., Coronaviridae (SARS and MERS coronavirus), Filoviridae (Ebola virus), Flaviviridae (Zika virus, hepatitis C virus, tick-borne encephalitis virus), Orthomyxoviridae (Influenza A virus), or Paramyxoviridae (measles virus).
However, for various matrices and matrix-pathogen combinations, established and validated protocols are missing. Previous diagnostic metagenomics studies dealt with selected sample types (e.g., stool16,17; intraocular fluids18) or were focused to specific pathogen groups (viruses, bacteria, or parasites). Therefore, the improvement and harmonization of pathogen-independent metagenomics to be used in human and animal health and food safety19,20 is necessary. To apply our metagenomics workflow that was originally developed for the detection of viruses4,5 to other pathogen groups, namely bacteria and parasites, we tested, refined, and verified the protocols. For that purpose, as much as feasible, different sample types were tested using virus-, bacteria-, or parasite-containing as well as uninfected control samples. We also included conventional food samples − untreated and highly processed − since they were seldom handled and evaluated for metagenomics before9. As a result of this effort, we present here in detail a well-harmonized sample processing workflow for diagnostic metagenomics without dedicated amplification steps enabling the detection of diverse pathogens in a broad range of different matrices, applicable with both Illumina or Ion Torrent platforms.
Important characteristics of the procedure
The sample processing workflow as depicted in Fig. 2 and described in detail in Supplementary File 1, is applicable with RNA or DNA as input and has been proven with respect to diagnostic metagenomics in veterinary medicine4,5,21–27. Here, the workflow was further tested for different sample types and pathogens as described below.
The workflow starts with a sample disintegration followed by RNA extraction. With only a few exceptions, which are discussed later, the provided protocol is suitable for the extraction of RNA from a broad range of sample types. The protocol proceeds from purified RNA to the final sequencing library with only a single intermediate purification step. This ensures maximum preservation of the information content of the sample.
Routinely, 500 ng (100–1,000 ng) purified total RNA are used for the synthesis of double stranded cDNA in a one-tube reaction, but the protocol is also suitable for extremely low input of RNA, even if the amount cannot be determined. Preferably, RNA solutions with concentrations lower than 10 ng/µl should be concentrated (option in Fig. 2.; Supplementary File 1, Procedure, optional steps 31–38 and Troubleshooting). After cDNA synthesis, the DNA is fragmented without prior purification to avoid loss of material.
Depending on the selected sequencing platform, we provide two possibilities for library preparation (Fig. 2A), one detailed manual procedure for sequencing with Ion Torrent (Supplementary File 1, Procedure, steps 59–77) and one automated procedure for sequencing with Illumina MiSeq (Supplementary File 1, Procedure, steps 78–92). For optimal sequencing results, the library fragment size should be within the specified range of the used sequencing platform and protocols. For both presented sequencing platforms, we apply a target peak size of 550 bp with a size range of 300–1,000 bp. This is achieved with a single two-step size selection procedure using solid-phase paramagnetic bead technology. Because the size of the bound DNA depends on the buffer concentration, calibration of the paramagnetic beads (Supplementary File 1, Reagent setup) is a prerequisite for a reproducible size selection.
Sample disintegration to extract high quality nucleic acids
Since the availability of the nucleic acids for library preparation is the determinant for the prospect of success of the effort, we compared three different sample disintegration techniques for their suitability to ensure the release of the nucleic acids from the sample material. The applied methods were two bead-beating techniques, one usually conducted in lysis buffer at room temperature, here represented by the TissueLyser, and one conducted with deep-frozen samples, here represented by the Micro-Dismembrator. The third applied technique was cryofracturing using the cryoPREP device. These techniques were tested using a number of different sample matrices and hard-to-break target species. Figure 3A shows the RNA quality achieved with the three techniques in disintegrating suspensions of exponentially growing bacterial cells or hard-shelled Gram-positive endospores (Fig. 3A, see also Supplementary File 1, Fig. A1). Clear bands of small and large subunits of ribosomal RNA, suggestive of high quality nucleic acid, were observed using the cryoPREP impactor or the Micro-Dismembrator grinding mill. In addition, we found a statistically significant (Fisher’s Exact Test, p ≤ 2.2E-16) increase in the proportions of mycobacterial reads in datasets derived from one tap water sample processed with cryoPREP compared to the dataset for the same sample without cryoPREP treatment (compare graphs for library IDs 2093 and 2094 in Supplementary File 2). Likewise, comparing the same datasets, we found statistically significant increases of obligate intracellular Coxiella species (3-fold, p ≤ 2.2E-16), of Parachlamydia-related species of amoebae (2-fold, p ≤ 2.39E-12), of Legionella species (7-fold, p ≤ 2.2E-16), and of Gram-positive Bacillaceae (5-fold, p ≤ 2.2E-16). Like for pure bacterial suspensions shown above, in case of pig faeces, the TissueLyser-disintegrated sample also showed the strongest degradation of RNA i.e. very short RNA fragments (Fig. 3B). In contrast, when pools of midges (insect vectors of orthobunyaviruses like Schmallenberg virus or orbiviruses like bluetongue virus) were disintegrated, cryoPREP and TissueLyser resulted in high quality RNA but not the Micro-Dismembrator (Fig. 3C). Moreover, Mycobacterium-containing tissues (lymph nodes and intestine) were used to assess the effectiveness of disintegration using the cryoPREP in comparison with the TissueLyser. The Cq values obtained with DNA extracted after cryoPREP disintegration were for a number of samples substantially lower than those after TissueLyser treatment (Table 1). In summary, generally the best results were achieved for all tested matrices and pathogens with deep-frozen samples using either the Micro-Dismembrator or the cryoPREP device.
Workflow verification with samples containing known verified pathogens
For the verification of the workflow, various routine diagnostic samples with pre-diagnosed pathogens were analysed. These samples comprised liquids, tissues, faeces, and foods. Table 2 summarizes results from 15 previously published and 12 new samples. We observed a substantial variation regarding the portion of reads representing the respective expected pathogen. Clearly, the observed variation is mainly caused by the strong background (see Supplementary File 2 for the 100 most abundant families found in each data set) naturally comprised in organ material and other samples (e.g., faeces or different foods) that reduces the pathogen signal in the dataset. Despite of this background, using the provided metagenomics sample processing workflow and a subsequent RIEMS28 analysis, it was in all but two cases possible to detect the expected pathogens even if the pathogen load was rather low like in case of chest cavity fluid from a VSBV-1-infected squirrel or for the MAP infected lymph node (Table 2). In the former case, the Cq for VSBV-1 was around 304 and 2 reads representing VSBV-1 were detected. In the latter, the Cq was around 27 after cryoPREP disintegration or nearly 41 after TissueLyser treatment and 0.001% of the reads represented MAP. The workflow works well not only for viral and bacterial pathogens but was also suitable for parasite detection (stool and wild boar in Table 2), albeit the proportion of parasite reads was very low. In case of ethanol-fixed stool samples, Blastocystis (0.04%) and Giardia (0.0009%) could be detected in a library generated from RNA template yielding little more than 230,000 reads. For the mycobacteria-containing samples (compare Table 1), we detected Mycobacterium reads only in the higher laded lymph node sample (Cq 26.7; 23 Mycobacterium reads in a total of 2.36E + 6 reads) in contrast to the intestine sample (Cq 36.8; 0 Mycobacterium reads in a total of 1.68E + 6 reads). For two samples presented in Table 2 (chicken liver with Sendai virus, library IDs 1949, 1950, 1951 and wild boar muscle with liver fluke, library IDs 2019 and 2043), technical replicates were processed and sequenced using the present workflow. In both cases, the results are congruent with regard to both the portion of pathogen and unclassified reads. Noteworthy, in all analysed samples, the proportion of unclassifiable reads was very low (Table 2).
Determination of the reagent specific background
In order to determine the inherent background of the workflow originating from the used consumables, we extracted both DNA and RNA from selected consumables and prepared and sequenced libraries. With a single exception (library generated from DNA extracted from pooled enzymes of the cDNA synthesis kit, 2.2E +6 reads), sequencing of the libraries generated from the selected consumables resulted in only a few reads (428–4,777 reads) by sequencing the complete extracted material. In the RIEMS28 analysed data sets, viral, prokaryotic, and eukaryotic reads were detected (compare Supplementary File 2) with the most frequently detected viral sequences belonging to the Retroviridae. All RNA-derived bacterial profiles were rather similar (Fig. 4). Bacterial groups with the highest read abundances were Enterobacteriaceae (7–21%) and Pseudomonadaceae (3–7%), followed by Burkholderiaceae (2–6%), Propionibacteriaceae (2–4%), Comamonadaceae (1–4%), Bradyrhizobiaceae (about 1%), and Staphylococcaceae (Fig. 4). In contrast, the profiles obtained for the DNA datasets differed substantially between the different reagents. Moreover, the bacterial profiles determined for the DNA and RNA derived data for both the DNase and the RNeasy column were clearly distinguishable. While in the RNA datasets sequences related to the Enterobacteriaceae and the Pseudomonadaceae clearly dominated, in both mentioned DNA samples, the highest proportion was found to be related to the Comamonadaceae (about 4%). Contrarily to the situation in DNase and RNeasy column, the DNA and RNA based bacterial profiles obtained for the pooled enzymes from the cDNA synthesis kit were similar, also resembling the RNA derived profiles obtained for the DNase and the RNeasy column, especially with regard to Enterobacteriaceae and Pseudomonadaceae. All datasets contained eukaryotic reads, mostly mammalian sequences indicating contamination probably from the production process and/or laboratory handling.
Workflow assessment with various matrices with initially unknown pathogen content
Table 3 lists results obtained from various sample matrices with unrecognized pathogen content from 13 published and 15 new samples. A major group of samples are typical diagnostic materials like different tissues, faeces, and liquids. In a number of the presented examples, novel pathogens were detected using the presented workflow and subsequently confirmed by other methods. In the other cases, sequences putatively representing pathogens were detected (see Table 3); however, these were not confirmed yet.
Arthropod vectors represent an individual type of sample matrix that might require a special treatment to ensure successful analysis (compare Fig. 3). Therefore, alternative homogenization and extraction options are provided here for ticks and midges (see options A and B of the Procedure). Different tick species (Ixodes ricinus, Ornithodoros porcinus, and Rhipicephalus bursa) were subjected to the described procedure and in the generated datasets, Rickettsia spp. were re-detected (previously detected via PCR29). In addition, the known tick-transmitted bacterial human and animal pathogens Anaplasma spp., Francisella spp. and Mycobacterium spp. were found (compare results for library IDs 1163 and 1164 in Supplementary File 2).
In addition to the aforementioned specimens, we also tested a number of highly processed food samples (Table 3). In all cases (meat loaf, pizza, crude ham), the resulting DNA libraries were of high quality, allowing the taxonomic classification of the vast majority (>98.5%) of the obtained reads. The proportion of unclassified reads ranged between 0.7% and 1.5%. As expected, in case of crude ham (see Table 3) we did not detect any sequences potentially representing pathogens within a dataset of roughly 600,000 reads (Supplementary File 2). Of these reads, RIEMS28 classified the vast majority of the reads as mammalian sequences and most of the remainder (558 reads) as Lactobacillus spp. Roughly 99% of the viral sequences detected in the crude ham were eukaryotic rRNA sequences misclassified as Arenavirus sequences. Further putative viral sequences represented phages.
Here, a wide array of matrices was processed successfully, as proven by the presented results. Nevertheless, challenges for sampling and sample processing remain. Materials that change their physical condition upon deep freezing, like e.g. gummy bears that become glass-like and spawn sharp-edged shivers when being cryofractured, ultimately destroying the Covaris TissueTUBE. Also, samples with a low pH that interferes with nucleic acid extraction (compare Table 2, norovirus polluted frozen berries, library ID 1962), appeared to pose a problem without pH adjustment, although failure to detect norovirus might have had other reasons as it was already reported that noroviruses are hard to detect by metagenomic sequencing42.
Two main areas generally impose problems for metagenomics pathogen detection; namely, the necessary reference sequences available in public databases (see next paragraph) and the available sample materials. While the former needs a concerted action of the scientific community to improve, the latter is in the hands of the individual labs. Usually, the best-suited raw material for metagenomic analysis is fresh or fresh-frozen and untreated, since nucleic acid integrity is compromised by prolonged storage or fixation. Although pathogen detection may still be possible despite fixation (compare library ID 2178 in Table 2), awareness needs to be raised that untreated aliquots of samples should be stored deep-frozen when intending metagenomic analysis. Highly processed food samples may likewise be difficult since their RNA content and integrity seems to be inherently low, probably due to the processing. Irrespective of processing, some foods impose difficulties as for instance fatty matrices like cheese or milk, or fruits with low pH (see above). If compatible with the respective workflow, countermeasures like defatting or pH adjustment may be introduced. Hard-to-break matrices like feathers, skin, or plant materials with high fibre content (e.g., oat-flakes) may need a dedicated assessment of the suitability of disintegration procedures. Applying our workflow, the detection of yet unconfirmed plant pathogens in the obtained datasets from both untreated and processed foods (compare Table 3) was possible. Regardless of sample type and workflow, problems can arise when faced with low pathogen loads and hence disadvantageous pathogen-host ratio, like in the MAP samples (Table 2, library IDs 2099 and 2100) and the fresh-frozen liver contaminated with liver fluke (Table 2, library IDs 1949-50). In the latter case, the problem was potentially caused by the patchy distribution of this relatively large parasite, resulting in samples with varying pathogen load. The problem of unfavourable pathogen/host ratios might be compensated by enhanced sequencing depth (see8) or host depletion/target enrichment. As pointed out already, the latter may lead to loss of information. In case of low total sample input, the DNA and/or RNA potentially contaminating the used consumables can significantly outcompete the target nucleic acids. Sequencing of kit components used in our workflow revealed retroviral sequences from the cDNA synthesis kit and bacteria also previously found in blank controls43 (Fig. 4). Moreover, cross-contamination of libraries due to adapter swapping44 or carry-over between runs8,45 has to be considered.
For comprehensive and reliable metagenomic analysis, reliable reference sequences are required. However, according to Klimke et al.46, “different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality”. Many organisms (hosts, symbionts, and pathogens) have not yet been sequenced and hence no reference sequences are available. Furthermore, the taxonomic identity associated with some sequences in public repositories47,48 have been found to be questionable. This appears to be also the case with the Arenavirus reference (Accession KF478765) that leads to frequent false positive detection of Areanviruses (compare Supplementary File 2). This is likely caused by an extension of the viral genome with a ribosomal sequence. These problems need to be solved in the future to further improve the use of metagenomics for pathogen detection.
Building on the previous experience from virus discovery, we extended the use of the presented workflow for the detection of pathogens other than viruses and tested a broad range of (diagnostic) sample materials. The resulting workflow we present is largely pathogen- and matrix-independent, i.e. it is applicable to at least the tested sample matrices and can potentially be used for all pathogen groups. It is important to mention that a key issue in sample preparation is to make all nucleic acids accessible to sequencing, here tried to achieve by using an efficient but gentle disintegration method. We routinely use this approach as “one serves all” analytical framework20 in cases where causative agents of animal diseases and zoonotic infections are unrecognized.
The performance of the overall workflow or of its individual modules was assessed using a spectrum of different matrices that can be grouped into the five categories liquids, faeces, tissue, vectors, and food. The processed samples were mostly diagnostic specimens representating liquids (serum, cell-culture supernatant, bacterial suspensions, swab samples, tap water, and rumen); faeces (pig, bird and human) as example of a complex inhibitor-rich matrix; organs like brain, heart, liver, lymph nodes, kidney, lung, and intestine to test the efficiency of the protocol on tissue; pools of midges and ticks, respectively, representing arthropod vectors; rocket, mushrooms, ham, meat loaf, pizza, strawberries as examples for different foods. In addition, TissueLyser and cryoPREP disintegrations were compared using goat lymph nodes and intestine from animals infected with Mycobacterium avium paratuberculosis (MAP; lymph nodes and intestine) that was available from an approved (Committee on the Ethics of Animal Experiments and the Protection of Animals of the State of Thuringia, Germany; Permit Number: 04-002/12) and previously published animal trial carried out in accordance with relevant guidelines and regulations49.
We used samples containing a pre-diagnosed pathogen (see Table 2) and samples with unrecognized pathogen content (see Table 3). The known pathogens comprised in the samples represented the groups eukaryotic parasites, bacterial pathogens, and viruses. In addition, bacterial suspensions of exponentially growing Bacillus subtilis, Staphylococcus aureus, and Escherichia coli, representing Gram-positive and Gram-negative bacteria, respectively, and an endospore suspension of B. subtilis as example of nucleic acids protected by highly resistant envelopes were processed. For selected samples, a sequencing library was generated according to the Supplementary File 1 (Procedure, steps 48–120) and sequenced following the respective manufacturer’s instructions.
In addition, we sequenced selected consumables used in our workflow to investigate their impact on the final sequencing outcome. The samples are an RNeasy column taken from the RNeasy Kit (Qiagen), the DNase (Qiagen) as used for the workflow, and the enzymes from the cDNA synthesis kit (Roche). The latter are the components “vial 2” (AMV RT), “vial 4” (Protector RNase Inhibitor), “vial 10” (2nd strand enzyme) and “vial 11” (T4 DNA Polymerase). For all samples, we extracted RNA as described and DNA using the QIAamp DNA Mini Kit (Qiagen) and prepared libraries as described in the Supplementary File 1. In addition, as a blank control, an 800-µl water sample (Roth) was processed with the present workflow.
Sample processing procedure
Detailed easy-to-follow single protocols (modules) for all steps depicted in Fig. 2A including necessary chemicals and important remarks (reagent setup, troubleshooting, anticipated results) are given as Supplementary File 1. In the following, only procedures supplementing the detailed protocol for comparisons are outlined.
We compared different sample disintegration techniques, namely the laboratory grinding mill Micro-Dismembrator (Sartorius, Göttingen, Germany), the TissueLyser (Qiagen, Hilden, Germany), and the cryoPREP impactor (Covaris, Brighton, UK). Sample disintegration using the Micro-Dismembrator was essentially performed as described27. Using the TissueLyser, tubes prepared with the sample material, a steel grinding ball and 200–1000 µl AL buffer (Qiagen, Hilden, Germany) were shaken for 150 s with a frequency of 30 Hz as previously described50,51. With a Micro-Dismembrator (Sartorius, Göttingen, Germany), the samples were ground frozen in liquid nitrogen for 2 min at 2000 rpm in a 3 ml PTFE shaking flask with a 10 mm stainless steel ball and the frozen homogenate was further processed according to the detailed protocol (Supplementary File 1, from step 9). The cryoPREP protocol is given in detail in Supplementary File 1 (Procedure, steps 1–10). RNA was extracted following the detailed protocol and quality was checked with a Bioanalyzer (Agilent) using a RNA 6000 pico assay according to the manufacturer’s instructions. DNA from Mycobacteria containing samples was extracted using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) and quantified regarding the content of mycobacterial DNA via real-time PCR (insertions element IS90052).
Bioinformatic analysis of metagenomic datasets
Obtained raw reads were analysed using the software RIEMS28 to get an overview of the taxonomic composition of reads.