Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
.
More datasets: Wikipedia | CORD-19

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

# Highlight for Query ‹Bovine coronavirus infectionmedication›

## A binning tool to reconstruct viral haplotypes from assembled contigs

Related work

Although a number of contig binning algorithms have been developed [15–21], they all possess limitations in distinguishing contigs from different viral strains of the same species. Most of the existing contig binning tools for microbiome sequencing data are designed for bacteria. These methods usually estimate the bin number by aligning metagenomic data to a pre-established marker gene database, and then assign assembled contigs to different bins using sequence composition information and read coverage levels. For example, MaxBin uses both tetranucleotide frequencies and contig coverage levels to assign assembled contigs into different bins.

Some binning tools leverage co-abundance of genes across multiple metagenomic samples. The rationale is that if two contigs are from the same bin, their coverage profiles across multiple samples should be highly correlated.

Recently, there are a couple of newly developed tools for strain level analysis from metagenomic data, such as Constrain and StrainPhlAn. Both rely on species identification using clade-specific genes, then zoom in to identify the strains. However, both tools were mainly tested on bacteria.

Our method is designed to cluster contigs produced by existing assembly tools. There are another group of methods conducting haplotype reconstruction via read clustering [22, 23], which groups variant sites obtained by read mapping against reference genomes. These tools don’t usually output contigs and thus do not use contig binning.

Here we present VirBin, a method designed specifically for binning contigs derived from viral quasispecies data. The input to VirBin is a set of contigs derived from assembly tools. The output includes the estimated number of haplotypes, the grouped contigs for each haplotype, and the corresponding relative abundances. Unlike many bacterial contig binning tools that require multiple samples, our method works on a single sample.

Data simulation

HIV haplotype construction: There are many sequenced HIV strains in the HIV Sequence Database. However, many of the strains do not possess sufficiently high similarity to be included in simulated quasispecies. Thus, we use both real and simulated strain sequences to simulate haplotypes of high similarity. Simulated strains were produced by mutating, deleting, or inserting bases at random positions from a real strain in the HIV database. As a result, the five-haplotype dataset contains a HIV-1 strain FJ061 from HIV Sequence Database, 3 simulated haplotypes from FJ061, and another HIV-1 strain FJ066. The sequence similarity between the simulated haplotypes and its originating sequence is 97%. The average sequence similarity between all the five haplotypes is around 93%, which is comparable to the sequence similarity between haplotypes in a mock HIV quasispecies dataset.

Reads simulation using different haplotype abundance distributions: With available HIV haplotypes, simulated reads were generated from them by ART-illumina as error-containing MiSeq paired-end reads, with read length of 250 bp, average insert size of 600 bp, and standard deviation of 150 bp. With the total coverage of 1000-x, three sets of reads are produced using different abundance distributions. The first one is based on the power law equation. The second and the third sets of reads represent challenging cases where different haplotypes have similar abundances, which create difficulties for abundance-based binning algorithms. The abundance differences in the second and third data set are 0.06 and 0.03, respectively. In total, there are 38,914 simulated reads for 5 HIV haplotypes. The relative abundance for five haplotypes in each read set can be found in Table 1. As the total coverage is 1000-x, the sequencing coverage of each haplotype is the product of the total coverage and the relative abundance.

Contig simulation: For each reference genome (denote its length as L), we randomly generated a list of location pairs (p1,p2), where 1≤p1 Each location pair represents a candidate contig’s starting and ending position. Then, in the simulated contigs, we only keep the ones above 500 bp (i.e. p2−p1+1≥500). In addition, we would like to simulate the hard case where the contigs cannot be extended any more using large overlaps. Thus, we sort all the remaining contigs by p1 and remove the ones that have overlaps of size above 100 bp with previous contigs in the sorted list. The five sets of simulated contigs have different N50 values and are referred to as “1000” to “5000”, indicating the upper bound of the contig length in each set. Additional file 1: Table S1 shows the detailed properties of the five sets of contigs. All the simulated data sets can be downloaded from VirBin’s Github repository.

Haplotype number estimation

According to our methods, the haplotype number estimation only depends on the alignment results of contigs. For all five sets of simulated contigs with different average lengths, the consensus window depth of the 50 longest windows is 5 for all. The histogram of window depth for 5 simulated contig sets is shown in Fig. 1. It is clear that window depth 5 dominates longest windows. Thus, the estimated number of haplotypes is 5, reflecting the truth for our data sets. In general, the percentage of windows with depth 5 increases with increasing contig lengths.

Clustering

We applied VirBin to cluster contigs into 5 groups. Since the ground truth about the haplotype membership of each contig is known, we were able to evaluate the clustering results by calculating the precision and recall at the base level. The evaluation results are shown in Fig. 2. The performance of clustering is worst for shortest contig set (denoted as 1000 along the Y-axis). With increasing contig lengths, the clustering performance becomes better for all three different abundance distributions. When the contigs are long, the clustering performance for haplotypes with different abundance distributions is comparable.

The results were compared with MaxBin, which is a binning tool for metagenomic contigs based on tetranucleotide frequencies and reads coverage levels. MaxBin requires marker genes to identify seed contigs for binning. We were able to run the core clustering program of MaxBin by inputting both the number of haplotypes (i.e. 5) and the seed contigs manually. We randomly chose one contig from each haplotype as the seed contig and calculated the contigs’ abundances by mapping reads to them using Bowtie2. Although the haplotype number was explicitly provided to MaxBin, empty clusters can be produced by MaxBin. The results from MaxBin are shown in Fig. 3.

For the shortest contig set, MaxBin only reported two clusters with one contig for each cluster, leaving 59 (96.7%) contigs unclassified. For contigs sets from 2000 - 5000, MaxBin was able to generate five clusters, but with 30% contigs unclassified. The results of MaxBin usually have lower precision or recall values than VirBinİn addition, for contig sets from 1000 - 4000, there are haplotypes without correctly assigned contigs. The lower sensitivity of MaxBin could be caused by its dependency on both sequence composition and contig coverage for clustering. Due to high sequence similarities between viral haplotypes, sequence composition is not informative enough in differentiating contigs from different viral strains. Instead, MaxBin could mistakenly cluster contigs from the homogeneous regions of the viral genome, leading to more chimeric clusters.

StrainPhlAn is also able to to characterize the genetic structure of viral strains in metagenomes. It takes the raw sequencing reads and MetaPhlAn2 database of species-specific reference sequences as input and aims to output the most abundant strain for each sample. However, it failed to detect any viral species at the first step running MetaPhlAn2. ConStrains is another tool designed to identify strain structures from metagenomic data. It uses bowtie2 to map reads to a set of universal genes and infers the within-species strains abundances by utilizing single-nucleotide polymorphism (SNP) patterns. This tool again did not get enough mapped reads to report any strain abundance. And it takes considerable efforts for us to modify their codes for our inputs. Thus, we cannot report the results from StrainPhlAn or ConStrains.

Relative abundance computation: Once the iterative clustering algorithm converges, the abundance of each cluster can be computed as the weighted average abundances for all contigs from this cluster. Fig. 4 compares the known haplotype abundance profiles with our computed ones. There are three read sets with different abundance distributions (Table 1). For each distribution, there are five sets of contigs (Additional file 1: Table S1). Thus, three plots of five curves are presented to compare the ground truth and the computed abundance. In addition, we applied χ2-test to compare the ground-truth distribution and each computed abundance distribution. The p-values from all the tests are larger than 0.99, indicating that the distributions are not statistically different. As MaxBin only correctly clustered several contigs, we did not include the abundance comparison.

Results on assembled contigs

In addition to simulated contigs, we also tested VirBin on assembled contigs by two de novo assembly tools: a generic assembly tool SGA and a viral haplotype reconstruction tool PEHaplo. The assembled contigs were evaluated by MetaQuast and the results are listed in Additional file 1: Table S2. Both PEHaplo and SGA produced enough contigs to cover almost all the five haplotypes. Contigs produced by PEHaplo have larger N50 values than contigs by SGA. While both tools apply overlap graph for assembly, PEHaplo utilizes a paired-end information guided method for path finding, which can potentially connect some of the nodes together. Therefore, it produced longer contigs than SGA. The contigs are paired with haplotypes based on the highest sequence similarity. All of the contigs and their originating haplotypes have similarity of at least 98%.

For all three sets of contigs assembled by PEHaplo and SGA on three sets of reads, the consensus window depth of the 50 longest windows is 5, revealing the actual haplotype number.

Figure 5a presents the clustering results on contigs generated by PEHaplo and SGA. It shows that VirBin achieved good clustering results on contigs assembled by both assembly tools. The clustering results on SGA’s contigs are similar to PEHaplo’s contigs, with both high precision and recall. This observation is consistent with the results on simulated contigs that when the contigs are long enough, VirBin can produce good results.

Again we compared our results with MaxBin. The clustering results of MaxBin on assembled contigs are shown in Fig. 5b. For contigs assembled by PEHaplo, MaxBin correctly clustered all corresponding contigs to the strain FJ061-h2 as the recall is 1.0. However, this cluster also involves many contigs from other strains as the precision value is low.

The comparison between the predicted abundance by VirBin and the ground-truth on two sets of assembled contigs is presented in Additional file 1: Figure S2.

We also simulated reads from 10 haplotypes and tested VirBin on this data set. The data generation and also the detailed results, clustering results is presented in Additional file 1: Table S3 and the relative abundances for each haplotype are shown in Additional file 1: Figure S3, can be found in the Section 3 of Additional file 1.

Mock HIV population MiSeq data set

In this experiment, we applied VirBin to a mock HIV quasispecies data set (SRR961514), sequenced from the mix of five HIV-1 strains (89.6, HXB2, JRCSF, NL43, YU2) with Illumina MiSeq sequencing technology. This data set contains 1,429,988 (250 bp) of reads that cover the five strains to 20,000x. The raw data set was pre-processed with FaQCs/1.3 and Trimmomatic to trim and filter low-quality reads or adapters. The remaining reads were then error-corrected with Karect. After pre-processing, 774,044 reads were left. By mapping pre-processed reads to the available 5 reference genomes by bowtie2, we were able to estimate the abundance for each haplotype as shown in Fig. 6.

We use the contigs assembled by PEHaplo as input for VirBin. PEHaplo produced 24 contigs from the real MiSeq HIV data set that can cover about 92% of the five HIV-1 strains. These contigs have a N50 value of 2223 bp and the longest contig is 9133 bp.

Haplotype number estimation: VirBin was applied to the aligned contigs for haplotype number estimation. All the windows were sorted in descending order of window length. Out of the top 50 windows, 27 contain 5 contigs, 16 contain 6 contigs, and 2 contain 4 contigs. Out of the top 25 windows, 17 contain 5 contigs, 5 contain 6 contigs, and 1 contains 4 contigs. Similar to the simulated data, using the consensus window depth (i.e. 5) correctly predicted the haplotype number.

Clustering results: The clustering algorithm was applied to cluster the contigs into 5 groups. For each contig, its originating haplotype is determined by comparing the contig with all reference genomes. The haplotype with the highest similarity and above 98% is assigned. The outputs of VirBin and MaxBin are shown in Table 2. StrainPhlAn and ConStrains were applied on this real HIV data set. StrainPhlAn was able to identify the HIV species, but could not report any strain information. ConStrains could not align enough reads to marker genes for further strain-level analysis.

Compared to the simulated contigs or assembled contigs using simulated reads, the results of VirBin on the real sequencing data have generally lower sensitivity and precision. There are two major reasons. First, the assembled contigs for real reads are more likely to contain errors. Second, this data set has several haplotypes with very similar average abundances. Referring to Fig. 6, the abundance difference between the 2 least abundant haplotypes is <2%. Thus, the clustering algorithm could mix contigs from these haplotypes.

For the mock data experiment, we also present the recall and precision at contig level in Additional file 1: Table S4.

We again compared the predicted abundance profile with the known one in Fig. 6. The χ2-test output p-value 0.9999 and 0.9995 for VirBin and MaxBin, respectively, indicating that the predicted abundance distributions by VirBin and MaxBin are not statistically different from the ground truth.

Discussion

Overall, VirBin can cluster more contigs into their originating haplotypes than MaxBin. While VirBin focuses on sub-contigs that are more likely unique to one haplotype, MaxBin clusters whole contigs, which could contain regions common to multiple haplotypes and makes read coverage more heterogeneous. In addition, sequence composition-based features such as tetranucleotide frequencies are not effective in distinguishing highly similar viral strains. Our experimental results show that VirBin works better for longer contigs that can cover bigger regions of the underlying genomes. When the genome coverage by the contigs is below 70%, the performance of VirBin deteriorates because it becomes harder to estimate the correct number of haplotypes. In addition, the empirical experience shows that it is difficult to classify two viral strains when the abundance difference between them is below 3%. Thus, although we have demonstrated much better contig binning performance for distinguishing viral haplotypes than other contig binning tools, genome-scale viral haplotype construction is still a challenging problem.

Conclusion

In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. By conducting experiments on multiple simulated data sets of different haplotype abundance distribution, different number of haplotypes, and different sets of simulated or assembled contigs, we demonstrated that our tool can produce more accurate clustering results than popular contig binning tools for viral haplotype reconstruction.

Step 1: estimate the number of haplotypes via contig alignment

Although the high similarity between haplotypes presents a barrier to adoption of kmer-based features for distinguishing contigs from different haplotypes, it brings opportunities for haplotype number estimation. With stringent alignment threshold, contigs that can be aligned with each other usually come from the same region of the virus and thus the number of aligned contigs can be carefully used for haplotype number estimation.

We progressively build multiple sequence alignments using contigs’ pairwise alignments. In this step, base-level accuracy of the alignment is not a major concern and thus progressive construction of the alignment between contigs can serve the purpose well. We first sort the contigs by their lengths in descending order. The longest contig will be used as the first reference. All the other contigs will be aligned to the reference using blast+ to generate an alignment profile similar to multiple sequence alignment. Two types of alignments are kept from the output of blast+. One is the alignment that is local to the reference but global to the shorter contigs. The other is overlap alignment, which is the alignment between the suffix/prefix strings of the contigs. If not all the shorter contigs can be aligned to the reference contig, this process will continue by using the second longest contig as the reference until all the contigs are used. Figure 8c shows the alignment between contigs from three haplotypes in Fig. 8a using the longest contig as the reference, which is usually produced for the most abundant haplotype.

Each multiple alignment can be divided into many windows, which are formed whenever there is a change of the sequences in the alignment. We define the number of contigs inside each window as the window depth, d. Based on these definitions, we have the following observations.

When each position of the underlying haplotypes can be covered by at least one contig, d is equal to or larger than the number of haplotypes. Note that the common regions between different haplotypes are regarded as different positions and thus should be covered by different contigs. This conclusion can be proved by contradiction easily.

Fig. 8b shows the contigs satisfying the conditions in the ideal case. There are three haplotypes with different abundances. They only contain mutations at three positions that are far away from each other. Because of the long common regions among them, assembly programs usually won’t be able to recover all the three genomes. Instead, they can generate short but correct contigs. In Fig. 8b, each position in the three haplotypes is covered by at least one contig. In this case, all the windows have depth of at least 3. If every position of a haplotype is only covered by one contig, the windows will have depth N, which is the number of haplotypes. As some positions can be covered by multiple contigs, the overlaps between contigs contribute to window depth larger than N. For example, in Fig. 8b, the middle window contains the overlaps between two contigs from each haplotype and thus has depth 6. In this ideal case, we can choose the smallest window depth as the number of haplotypes in a sample.

In practice though, the assumptions about the contigs’ properties are not always true. Thus, in our implementation, we will use the consensus window depth as the number of haplotypes, by assuming that most windows cover all haplotypes and contain haplotype-specific mutations. For the contigs shown in Fig. 8c, window depth 3 is the most frequent one.

In the implementation, we first sort all the windows in descending order of window length. Then we choose the most frequent window depth of the top X windows as the number of haplotypes. The default value of X is 50 in our implementation. We will present the results of haplotype number estimation using consensus window depth for both simulated and real quasispecies data.

Step 2: contig clustering based on relative abundance distribution

Let the number of haplotypes estimated by step 1 be N. The final goal of the pipeline is to divide the contigs into N groups so that each group contains contigs originating from the same haplotype. In this step, we conduct clustering on subcontigs from windows with depth N.

Unlike many existing contig clustering tools, our clustering is not applied to a complete contig. Because each contig can contain both haplotype-specific region and shared regions among different haplotypes, using the read coverage profile of the whole contig will confuse the clustering algorithm and makes the convergence slow or leads to wrong assignment of the objects. For example, in Fig. 8c, the contig “ A

S2” contains mutation A from one haplotype and also a shared region S2. Thus, significantly more reads will be mapped to the shared region and make the coverage for this contig highly heterogeneous. Thus, the objects as input to the clustering algorithm are “sub-contigs” in windows of depth N, where the sub-contigs are substrings of the contigs in these windows. They are more likely to represent the relative abundance of one haplotype. After clustering on sub-contigs, a post-processing step will be applied to get the cluster membership for each contig based on its sub-contigs’ memberships.

The clustering algorithm we adopt is prototype-based clustering and is essentially an augmented K-means algorithm. In a standard K-means algorithm, the centroid of the objects in a cluster is the prototype of the cluster. In our algorithm, the prototype is a distribution that is derived from the sub-contigs and empirically describes the relative abundance distribution.

The clustering algorithm will assign each sub-contig to one cluster based on the posterior probability of the abundance distribution. Although different clustering methods such as Gaussian mixture model can be applied to cluster the sub-contigs, the augmented K-means as shown below has the fastest convergence with better clustering accuracy according to our tests. Before we describe the main components, we first introduce the notations. The average relative abundance (denote as \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\bar {c}$\end{document}c¯) for a sub-contig ci in a window of depth N is calculated as:

1\documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document} $$\bar{c_{i}} = \frac{S(c_{i})}{\sum_{j=1}^{N}{S(c_{j})}}$$ \end{document}ci¯=S(ci)∑j=1NS(cj)

where S(ci) is the total number of reads covering sub-contig ci. Similarly, we can calculate the position-specific relative abundance vector \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\vec {c}$\end{document}c→ for a sub-contig ci as

2\documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document} $$\vec{c_{i}}[\!k] = \frac{c_{i}[\!k]}{\sum_{j=1}^{N}{c_{j}[\!k]}}, k=1..|c_{i}|$$ \end{document}ci→[k]=ci[k]∑j=1Ncj[k],k=1..|ci|

where ci[ k] represents the reads coverage at position k of sub-contig ci. |ci| is the number of bases in the sub-contig.

N is the number of bins or haplotypes estimated by Step 1. VirBin utilizes the position-specific relative abundances of sub-contigs in windows with depth N to estimate the probability that a sub-contig belongs to a bin. Let the N bins be H1,H2,…,HN. Let \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}(x)$\end{document}E~i(x) be the probability density function of relative abundance for sub-contigs from bin i, where x is an observed abundance.

The iterative clustering algorithm contains four steps as shown below:

Initialization: Initialize N groups by randomly assign sub-contigs to them.

Updating the empirical abundance distribution\documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}$\end{document}E~i: For each bin i, the component sub-contigs’ relative abundance profiles \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\vec {c}$\end{document}c→s are aggregated to calculate the empirical probability density function \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}$\end{document}E~i. The aggregation is performed by calculating the normalized histograms for these relative abundance profiles, so that the summation of histogram values will be 1. The number of bars of the histogram is a parameter that can be set by users. The default number is 1000. Given an abundance value x, it is normalized to its closest bar value to get the corresponding probability \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}(x)$\end{document}E~i(x).

Re-assignment of the sub-contigs: Once \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}$\end{document}E~i is derived, the relative likelihood of cj being produced from the ith prototype distribution can be calculated as \documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document}$\tilde E_{i}(\bar {c_{j}})$\end{document}E~i(cj¯). The prior probability of each bin (or haplotype) is a weighted sum of the likelihoods of all the component sub-contigs. The weights are determined by the total bases in the sub-contigs. This weighted sum enables us to incorporate both the total sub-contig bases and also the associated abundance generation likelihood for estimating the prior probability of a particular haplotype. The prior probability Pr(Hi) for bin i is:

3\documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document} $$Pr(H_{i}) = \frac{\sum_{c_{j} \in H_{i}} \tilde E_{i}(\bar{c_{j}}) |c_{j}|}{\sum_{i= 1.. N}\sum_{c_{j} \in H_{i}} \tilde E_{i}(\bar{c_{j}}) |c_{j}|}$$ \end{document}Pr(Hi)=∑cj∈HiE~i(cj¯)|cj|∑i=1..N∑cj∈HiE~i(cj¯)|cj|

With both likelihood and prior from iteration t, the expected probability that cj belongs to haplotype Hi at iteration t+1 can be calculated as likelihood∗prior, that is

4\documentclass[12pt]{minimal}

\usepackage{amsmath}

\usepackage{wasysym}

\usepackage{amsfonts}

\usepackage{amssymb}

\usepackage{amsbsy}

\usepackage{mathrsfs}

\usepackage{upgreek}

\setlength{\oddsidemargin}{-69pt}

\begin{document} $$P(c_{j} \in H_{i}|\bar{c_{j}}) \propto E_{i}(\bar{c_{j}})*Pr(H_{i})$$ \end{document}P(cj∈Hi|cj¯)∝Ei(cj¯)∗Pr(Hi)

With the posterior probabilities calculated for each group distribution, we can reassign the sub-contig ci to the haplotype with the maximum posterior probability. The same reassigning procedures are applied for all the sub-contigs. With the assignment results, the distribution Ei and prior probability P(Hi) can be updated.

Iteration: Iterate step 2 and 3 until the clustering results do not change or the maximum number of runs have been achieved. The default maximum number of runs is 100.

Step 3: post-processing

The output of the augmented K-means is the clustered sub-contigs. For each cluster, its average abundance is calculated as the weighted average of the abundances of all sub-contigs in the cluster and the weight is determined by the length of a sub-contig. The haplotypes’ abundances are the average abundances of the clusters.

As each contig can contain multiple sub-contigs, which could have different membership, the contig’s membership is determined by the dominant membership of its sub-contigs. For example, if a sub-contig is not in the window of depth N, it is not an input to the clustering step and will not be clustered. This could happen when a region of a contig is common to multiple haplotypes. It is also possible that the sub-contigs of a contig are assigned to different clusters, which could be caused by assembly errors.