Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
.
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹Bovine coronavirus infection symptoms

Host factor prioritization for pan-viral genetic perturbation screens using random intercept models and network propagation

Introduction

Genetic perturbation screens, such as RNA interference (RNAi) and CRIPSR-Cas9 screens, allow for the detection of host dependency and restriction factors by perturbing a target gene or transcript and observing its impact on the life cycle of a pathogen. In RNAi screens, genes are perturbed with small interferring RNAs (siRNAs). These are 20-25 nucleotides in length, complementary to mRNAs, and cause post-transcriptional gene silencing [1, 2]. The absence of certain host proteins has been shown to have an impact on the life cycle of pathogens [3, 4, 5], e.g., by reducing the ability of the pathogen to grow or by enhancing it.

Positive-sense ssRNA viruses (in the following also called group IV viruses according to the Baltimore classification) such as the Hepatitis C virus, all share some common steps in their replication cycle. First, the virus enters the host cell and releases its RNA genome into the cytoplasm. Translation of the RNA results in the expression of viral (nonstructural) proteins that assemble into a replication complex that drives the synthesis of new viral RNA. Newly synthesized genomic RNA is encapsulated by capsid protein. Eventually, new virions are assembled and released from the infected cells [7, 8, 9]. For virtually all of these steps, the virus strongly depends on host proteins due to the small RNA virus genomes with limited coding capacity. Another common feature of +RNA viruses is that their RNA synthesis takes place in specialized structures that are associated with modified host membranes. In order to understand the virus-host interplay reliable identification of potential host factors involved in virus replication is crucial.

However, statistical inference of these host factors is for multiple reasons often complicated. For example, siRNA-mediated knockdown can cause off-target effects such that often not only the transcript of interest is degraded but also other transcripts resulting in a non gene-specific phenotype [11, 12, 13, 14]. Furthermore, in cell-based assays different cellular states or cell context might lead to heterogeneous readouts [15, 16, 17].

So far statistical identification of host factors has either been conducted for single viruses [4, 8, 18, 19, 20], for two viruses of the same genus [5, 21] or family [22, 23], or for a group of only very remotely related pathogens. Prioritizing host factors on a viral group level, such as the group of positive-sense ssRNA viruses, has until now not been pursued in detail, even tough it seems promising, because viruses of the same group often have very similar replication cycles. Pathogens of one group might utilize the same, or at least functionally related, host factors and cellular pathways for replication. Consequently, development of anti-viral drugs targeting common host factors would have the potential for broad-spectrum activity. Despite its potential there are only very few pan-viral drugs under clinical investigation, for instance inhibitor development for PI4Kβ targeting various human enteroviruses. One of the reasons could be that the overall success rate for inferring pan-viral hits seems to be low, since even for single viruses the identified host or restriction factors have shown to be highly variable between different studies (e.g. between and). Interestingly, if hits found against one virus are tested against other viruses of the same group, it may well be observed that they are effective in the other viruses as well, which speaks for the hypothesis that analyses on a pathway-level could be promising or even necessary approaches.

Yet in most studies, statistical analysis is limited to gene- or siRNA-wise hypothesis tests, e.g., using t-tests or hyper-geometric tests [26, 27, 28, 29], not considering a priori information, for example, using biological networks, such as protein-protein interaction networks or co-expression networks. Network approaches have admittedly been used for various gene prioritization tasks [30, 31, 32, 33], but so far have found only little attention in virology. For instance, Maulik et al. have presented a clustering approach to detect modules in a bipartite viral-host protein-protein interaction network to identify host factors. Amberkas et al. use a meta-analysis approach using network modules for RNAi screens. Wang et al. use a scoring system based on integration of several RNAi screens to account for false positives and negatives. However, while these approaches include a priori knowledge, they cannot be used to detect genes on a pan-pathogen level.

Here, we present a two-stage procedure for pan-pathogen host dependency and restriction factor identification (Fig 1), and apply it to RNAi screening data sets comprising four different positive-sense ssRNA viruses, i.e. Hepatitis C virus (HCV), Chikungunya virus (CHIKV), Dengue virus (DENV) and SARS-coronavirus (SARS-CoV). First, we apply a maximum likelihood approach for joint analysis of viral host factors using a random effects model. Then, we propagate this information over a biological graph using network diffusion with Markov random walks in order to account for genes of importance on a pathway level, reduce the number of false negatives and possibly stabilize the ranking of host factors. With our approach it is possible to detect novel pan-pathogen host factors, while also considering prior information in the form of networks. Our model has been designed for heterogeneous data sets by accounting for various confounding factors within the data. When applying our method to six different RNAi screening data sets of the four positive sense ssRNA viruses, CHIKV, DENV, HCV and SARS-CoV, we found that the procedure is able to recover the host factors for single viruses that have been described in the literature before, and to predict novel pan-pathogen host factors. We validated the host factors for which compounds were commercially available experimentally using pharmacological inhibition screens for five virus, i.e., HCV, DENV, CHIKV, Middle-East respiratory syndrome coronavirus (MERS-CoV) and Coxsackie B virus (CVB). Moreover, we validated the newly predicted host factors, UBC, EP300 and PLCG1, using another siRNA knockdown on the Hepatitis C virus.

Gene effect ranking

The model defined in Eq (1) allows identification of potential host dependency and restriction factors on a pan-pathogen level, i.e., detection of host genes that potentially alter and impact pathogen growth. The strength of the effect of a gene knockdown (the effect size) on the replication cycle of a group of pathogens is given by the estimated random effect γg for a gene g. A negative gene effect γg < 0 means that knockdown of gene g restricts viral replication. A positive gene effect γg > 0 means that knockdown of gene g promotes viral replication. Furthermore, we estimate the pathogen-specific gene effect as ρvg = γg+ δvg.

Gene effect network propagation

We employ network diffusion to inform our estimates on a pathway-level post-inference and in order to account for host genes missing in the analysis (for instance, unscreened genes), potential false negatives, and to stabilize gene rankings using prior information. The diffusion is used after estimation of gene effect sizes using the random effects model from Eq (1). The Markov random walk is applied over a network of genes where edges represent biological relationships. These relationships can, for example, be encoded as interaction strengths between proteins, gene co-expression patterns, or common transcription factor binding sites. Using network diffusion it is possible to spread the information of single starting nodes, i.e. genes for which gene effects γg have been estimated (Eq (1)), to their surrounding neighbours to include potential genes in the list of host factors, reduce the number of false negatives and stabilize the predicted ranking of genes given by their effect strengths γg.

Instead of choosing neighbors of a gene directly which would potentially introduce false positives, Cowen et al. argue that a diffusion approach has the advantage of down-weighing new predictions that are only supported by few edges or edges with low weight. Furthermore, genes that are connected to the prior list of genes by several edges or edges with high weights have stronger support.

We initialize the starting distribution over N network nodes of the Markov chain as:

p0=(|γ1|∑|γi|,…,|γG|∑|γi|,0,…,0)T,(2)

where G ≤ N is the number of genes estimated using Eq (1), i.e. the number of genes with estimated effects γg. Using p0 the Markov chain is run until convergence with updates,

pt=(1-r)Wpt-1+rp0,(3)

where r is a user-defined restart probability, i.e., the chance that the random walk returns to its initial state and W is a left stochastic transition matrix derived from a biological network. In this study we use the functional protein interaction network from. They define a functional interaction as one in which two proteins are involved in the same biochemical reaction as an input, catalyst, activator, or inhibitor, or as two members of the same protein complex, i.e. functionally significant molecular events in cellular pathways and not mere protein-protein interactions which rarely show direct evidence of being involved in biochemical events. The network consists in part of expert-curated, high-quality functional edges and in part of edges that have been trained and validated with a naive Bayes classifier. Unlike many other biological networks, the high quality of the annotations does not necessitate choosing edges with care, such as edges derived from computational annotation or inference with older yeast-two-hybrid technologies which are frequently false positives. Moreover, due to the biological interpretability of the edges in a pathway-context, a functional network like this should serve as a good choice to infer novel restriction and dependency factors and stabilize our rankings, because it associates genes connected with a disease and separates genes with mere physical interaction as in conventional pairwise networks. We stochastically normalized the weighted adjacency matrix of this network and then use the normalized matrix as transition matrix W. After convergence of the Markov chain, we use its stationary distribution p∞ as new ranking of host factors by sorting genes accordingly.

For a random walk on a network that uses restarts, the length of the walk, l, i.e., the number of edges it travels, can be modelled as a geometric random variable:

Pr(l)=(1-r)l-1r,l∈{1,2,…}

that is parametrized by a success probability r ∈ [0, 1), and models the number of Bernoulli trials l needed for a success. The mean of the geometric distribution E[l]=1r directly relates to the average length of the random walk. For instance, choosing a success probability of r = 0.5 would result in on average 2 trials until success. For a success probability of r = 0.2 the average number of trials is E[l]=5, which yields an average path length of 5. Consequently, choosing a high success probability reduces the average number of edges travelled automatically and ranks the starting genes higher than genes farther away. We chose to use a restart probability of r = 35%, opting for on average approximately 3 travelled edges. Restart probabilities higher than 50% deprioritize the network information over the data, while lower restart probabilities than 20% give too much weight to the prior knowledge.

Data simulation

We simulated data using the procedures described in Supplement S2 and S3 Texts. Briefly, we sampled random vectors of effects for genes, viruses and screen types and took all possible combinations over the three random vectors. Then, we replicated every observation 8 times to guarantee convergence of the solver and added normal i.i.d. noise to every observation. We created three data sets and added low, medium, and high i.i.d. white noise (ϵ∼N(0,σ2), σ2 ∈ {1, 2, 5}), respectively, separately to every observation.

Performance measures for stability analysis

We boostrap every simulated data set or biological 10 times. For every bootstrap sample we sort the gene effects from the hierarchical model by their absolute effect sizes and the equilibrium distributions of the network diffusion. For every bootstrap sample j we take the top n ∈ {10, 25, 50, 75, 100} gene effects as well as the top n equilibrium probabilities. We then take each pair (j, k) of bootstrap samples and compare the top n gene effect vectors and highest n equilibrium probability vectors. For every pair (A,B) of the top n elements of either gene effects or equilibrium distributions, we compute the Jaccard index as J(A,B)=|A∩B||A∪B| and Spearman’s correlation coefficient (Supplement S2 Text and Supplement S1 Code).

Performance measures to assess predictive performance

We use 10-fold cross-validation in order to assess the predictive performance between our random effects model (Eq (1)) and PMM. We repeatedly split the data in training and test sets and iteratively trained on nine folds and predicted gene effects on the test fold. Finally, we compute the mean squared error for every fold for each of the two models (Supplement S3 Text and Supplement S1 Code).

Data sets and normalization

We integrated data from six RNAi perturbation screens consisting of the four positive-sense ssRNA viruses HCV, DENV, CHIKV and SARS-CoV. These screens have been generated under different biological conditions (Table 1). Following the definition in Eq (1), we distinguish different stages of infection, i.e., either ‘early’ when the screen was conducted for detection of host factors that are essential for viral entry and replication, or ‘late’ when the host factors are required for viral assembly and release. Screening of ssRNA viruses has been conducted on MRC5 cells for CHIKV, Huh7 cells for DENV, Huh7.5 cells for HCV, and 293/ACE2 cells for SARS-CoV. The screens used either libraries of Dharmacon SMART-pools (4 siRNAs per well/gene) for CHIKV and SARS-CoV or unpooled Ambion libraries for HCV and DENV. We filtered the six RNAi data sets for genes that are available for every virus which left a data set with a total of 714 genes and controls (Fig 1). For each of the screens, siRNAs have been placed on 384-, or 96-well plates, respectively. Cells have been seeded and, after transfection with siRNAs, infected with the respective reporter virus (Table 1). Univariate readouts are either measurements of viral or reporter protein (GFP/Luciferase).

In order to have comparable phenotypes, i.e., fluorescence and luciferase readouts, special emphasis has to be put on normalizing the screens, because different cell types (MRC5/Huh7/Huh7.5/293ACE2) can lead to slightly different gene expression and knockdown patterns. Furthermore, in addition to high between-screen variability in RNAi perturbations, high variance between plates from the same screen has to be taken into consideration (Fig 2). Before normalization plates are not comparable due to highly varying plate effects (Fig 2a). After normalization the data are in a final step centered and scaled to unit variance yielding comparable phenotypes (Fig 2b).

High variability of phenotypes is mainly due to batch effects, stochasticity in transfection and knockdown, and spatial effects in rows and columns, i.e., when wells on the margin on average have higher or lower readouts compared to wells in the center. To account for these effects, we use a combination of different normalization techniques for every screen separately (Supplement S1 Text for details). Briefly, the CHIVK and SARS-CoV screens use a pooled Dharmacon library on 96 well plates. We normalized the two data sets by first taking the natural logarithm over all samples, then substracting the mean background signal and finally computing a robust Z-score over the whole plate readout. The procedure has been applied separately for every plate. Since genes were not randomized on plates we did not use B-scoring or other methods that account for spatial effects [2, 29, 38]. For the HCV and DENV genome screens, we computed the natural logarithm for every readout of the complete data set, B-scored the plates using two-way median polish and, in a last step, calculated robust Z-scores. The HCV and DENV kinome screens have been normalized by first taking the natural logarithm of the well readouts and then fitting a local regression model to correct for cell counts. Since the HCV and DENV screens have randomized plate designs, we also corrected for spatial effects using two-way median polish using B-scores and eventually computed robust Z-scores (see Supplement S1 Code for the exact procedures).

Stability analysis

The models described by Eqs (1) and (3) estimate gene effects γg and an equilibrium distribution p∞g for every gene g. To assess the reproducibility of these estimates, i.e., the consistency of the rankings of gene effects and equilibrium distributions, we applied the model to several simulated data sets as well as to the pan-viral biological data set introduced above.

Simulated data. We simulated data as described before and validated the consistency of the rankings of these data sets (Fig 3a). For low error variances the stability of both the random effects model and the network diffusion is high between bootstrap samples. Increasing the error levels for the hierarchical model only seems to reduce the Jaccard index, while the Spearman correlations are staying stable. For high error levels and the first n = 10 genes, two sets of bootstrap samples have on average 60% similarity and a correlation of around 90% for the random effects model. The network diffusion, on the other hand, seems to be robust to increasing error variances having similar Jaccard indexes and correlation for medium and high error variance, emphasizing the previous argument regarding the stabilizing function of the network diffusion.

Biological data. We performed a similar analysis on the biological data set. Instead of comparing different noise levels we validated how the number of examined viruses influences the different rankings. We bootstrapped the data set again and computed the Jaccard index and Spearman’s correlation coefficient for every pair of bootstrap samples. For both models, increasing the number of viruses from 2 to 4, does not significantly alter the Jaccard indexes for all numbers of genes (Fig 3b). However, increasing the number of viruses reduces correlations for both models. While the reductions are only marginal for higher gene numbers for the random effects model, they are stronger for the network diffusion. Lower correlations can be explained by the fact that RNAi screens are highly variable and different bootstrap samples give as a consequence varying estimates of gene effects.

Analysis of predictive performance

In order to validate the predictive performance of the random effects model from novel data, we used a simulated data set and the biological data set as before, and benchmark the predictive performance using 10-fold cross-validation. We compare our method against another random effects model, called PMM.

Simulated data We created three data sets using the procedure described in Supplement S3 Text. As before, the data sets can be distinguished by the amount of noise that has been added to every observation. Our hierarchical model consistently outperforms PMM for different levels of variance and different validation methods (Supplement S4a Fig). This is largely due to the fact that our model was tailored to considering heterogeneous RNAi screens where different infection stages are present while PMM does not make this distinction.

Biological data For the biological analysis we used the integrated pan-viral RNAi screen as before. In this benchmark, our model slightly outperforms PMM (Supplement S4b Fig). Our model achieves a lower mean residual sum of squares on all test sets. Furthermore, increasing the number of viruses from two to four, leads to a decrease of mean residual sum of squares.

Gene effect ranking

Given the results from the stability analysis and analysis of predictive performance, we concluded that the proposed random effects model model is preferable to PMM, due to the fact that it captures more of the variance in the data, for instance, when strong infection stage effects are visible, and because it allows distinguishing between genes that are influencing the viral replication cycle in the early stages of replication, or in the later stages, respectively.

We applied the hierarchical model to the pan-viral data set and inferred the gene effects γg (of which the top 25 are shown in Supplement S5 Text). We then used the estimated gene effects γg and propagated these using the Markov random walk described in Eq (3). After diffusion we obtain a ranking of all genes in the network (Table 2). While the majority of genes has already been previously selected by the random effects model, we also discovered novel hits, such as UBC (rank 1), EP300 (rank 9), and PLCG1 (rank 13) using the network diffusion. Among the strongest effectors derived from the hierarchical model are, DYRK1B (rank 3), a nuclear-localized protein kinase participating in cell-cycle regulation, and PKN3 (rank 11), a rather little studied kinase that has been implicated in Rho GTPase regulation and PI3K-Akt signaling. UBC encodes ubiquitin, which is involved in numerous cellular processes, most prominently protein degradation. PLCG1 is crucially involved in signal transduction from receptor-mediated tyrosin kinases (e.g. Src) and catalyzes the formation of the second messenger IP3 and DAG. Recently PLCG1 was also found to impact progression of HCC, the HCV replication cycle, as well as receptor-mediated inflammation and innate immunity. EP300 is an acetyltransferase and acts as a transcriptional co-activator and has not been studied in detail so far.

We compared the strongest gene effects γg inferred by the hierarchical model (Supplement S5 Fig) to the virus-specific gene effects ρvg (for which RNAi screens have mostly been used; Fig 4) and found that for some of the estimates for the gene effects γg the pathogen-specific effects are not consistent over all pathogens. For example, while perturbation of gene CDK5R2 has a beneficial impact on CHIKV replication, it has a restricting effect on the other three viruses. On the other hand perturbation of DYRK1B, PKN3, CDK6, or CSNK2B has either an all-negative or all-positive impact on the replication cycle of the ensemble of viruses. Genes that upon perturbation show the same consistent effect, i.e. suppression of early or late stages of the viral replication cycle, could be targets for the development of broad-spectrum antiviral drugs.

Validation of identified host factors

We validated some of the top genes from Table 2 using pharmacological inhibitors to verify whether the predicted genes are indeed host factors that are involved in viral replication. In short, we searched the literature for inhibitors and conducted a screen for the proteins for which compounds were commercially available (see Supplement S4 Text for details on the experimental setup and Supplement S6 Fig for results). In order to assess if the top inferred gene products really have a pan-viral effect, inhibitors were tested on DENV, CHIKV and HCV as before and two novel positive-strand ssRNA viruses, MERS-CoV and CVB. Of the top 20 host factors from Table 2 inhibitors were available for the dependency factors CAMKK2, CDK5R2, DGKE, DUSP1, DYRK1B, PIK4CA, PKN3 and PLK1. The inhibitors were tested in dose-response CPE reduction assays on cells infected with the viruses. In parallel we assessed cytotoxicity of the compounds and discarded measurements that led to a significant reduction in cell viability (below 75% of the signal obtained for untreated control cells). For every host factor, virus and compound concentration, we tested if inhibition of a protein reduced viral replication in comparison to a negative control significantly (one-sided two-sample Wilcoxon test). We adjusted all p-values for multiple testing using the Benjamini-Hochberg correction. We found that inhibition of several host factors showed significant reductions in replication on subsets of the five viruses and specific compound concentrations. For instance, CDK5R2, PKN3 and DYRK1B were significant at the 10%-level after multiple testing correction for at least some compound concentrations in four of the five viruses. However, none of the tested compounds had a significant effect on the replication of all of the five viruses (Supplement S6 Fig). Note that PLK1 was discarded due to cytotoxicity of the inhibitor at higher compound concentrations. For that reason, we point out that PLK1 should possibly also be discarded in the analysis of the primary screens.

Furthermore, we validated the three genes that were newly identified by the network model (UBC, PLCG1, EP300) for HCV using two different siRNAs per gene. In particular, we were interested to see whether knockdown of these three genes would impact the viral replication significantly (see Supplement S5 Text for experimental details, data normalization and statistical analysis). We found that knockdown of UBC and PLCG1 caused a significant inhibition of replication at a level of α = 5% (Fig 5) in comparison to a negative control for all tested siRNAs (two-sided two-sample Wilcoxon-test). However, EP300 was not confirmed at the same significance level for both siRNAs tested.

Discussion

In this work, we have integrated RNAi screening data of a group of four different positive-sense ssRNA viruses and presented a two-stage procedure to prioritize pan-viral host dependency and restriction factors from genetic perturbation screens. The result of our method is a ranking of genes that are predicted to impact the life cycle of an entire group of pathogens. We implemented the two-stage procedure in an R-package called perturbatr which is designed for the analysis of large-scale high-throughput perturbation screens of multiple data sets and is available on GitHub and Bioconductor.

We validated host factors for which pharmacological inhibitors were commercially available experimentally by treating cells infected with five positive-sense ssRNA viruses with these compounds, and another siRNA knockdown of the three newly predicted genes on HCV.

Our procedure first infers a list of possible host factors using a random effects model where we model the readout of a genetic perturbation screen as a linear dependency on a virus, a pan-viral gene effect γg, and a sum of other random effects to capture the heterogeneity of the data. With a likelihood-based formulation jointly analyzing genetic perturbation screens of different viral RNAi screens is straightforward in comparison to a meta-analysis, since in the latter case every virus is analyzed independently and results have to be aggregated, thereby potentially discarding common host factors. Furthermore, the noise model and inclusion of random effect terms allow to account for high variance in the data sets.

The list of gene effects γg is then propagated over a functional interaction network using a Markov random walk with restarts. Functional interactions networks, such as, allow incorporation of true biological association in a pathway-context to the analysis and stabilizing of the the rankings. By subsequently applying a network diffusion approach it is also possible to not only account for genes that have not been in the primary RNAi screens, but also to re-rank genes using pathway information allowing to potentially reduce the number of false negative predictions.

The analysis produced a set of host factors, such as DYRK1B, UBC, PLCG1 and PKN3, that likely impact the replication cycle of a broad range of positive-sense ssRNA viruses. Of the top 20 host factors (Table 2), we were able to find commercially available compounds for nine of them, which we then biologically validated. While the screen confirmed the importance of these genes on the pan-viral replication cycle of subsets of viruses, no host factor could be found that is significant for all viruses. In general, viruses usurp defined cellular pathways. Even closely related viruses may use different entry points to the pathway. One example are the Dengue and Zika viruses which both depend on the host factor STT3A, but only DENV requires STT3B for replication. The degree of similarity of the molecular biology of the viruses seems to determine the success of finding pan-viral genes in contrast to finding relevant pathways. While it makes theoretical sense that all positive-sense ssRNA viruses use the same host factors, detection of these has proven to be complicated and, as already mentioned in the introduction, yields variable results even for the same virus. A lack of overlap between screens, flexibility of the cell in several aspects and the possibility of viruses to just take different routes to achieve replication corroborates this hypothesis and makes pathway-analyses even more important. The broader the targeted group of viruses, the more central a target gene would have to be (e.g. UBC), but in that case it gets increasingly unlikely to find a inhibitor condition that only harms the virus but not the host cell. For bacteria, antibiotics are only specific to a more or less related group of bacteria (e.g. gram-positives), because of the metabolic similarity of the group. For viruses, it is likely that these groups need to be much narrower because in many cases only closely related viruses might actually share enough similarity in the metabolic or regulators pathways they exploit. Additionally, it has to be emphasized though that a protein inhibition screen like the one we conducted is not perfectly able to validate the inferred genes and their function in the replication cycle of the viruses. Thus a more rigorous validation could shed light on the biological importance of these genes.

We validated the three newly found host factors, UBC, PLCG1, and EP300, using siRNA knockdown for HCV and could confirm UBC and PLCG1 to be proviral host factors. Generally, host dependencies and restriction factors are not necessarily crucial for host cells survival, i.e. host factors can be knocked down without inducing cell death. Exceptions are single candidates such as UBC which is central player in cell biology. Ubiquitination of proteins can target them for degradation in the proteasome which is an important homeostatic process in every cell. The proteasome has come up frequently as host factor for many viruses, albeit not always the same genes. Inhibition of the proteasome, while being vital for the cell, is already done therapeutically, for instance in cancer treatment, or in studies for antiviral treatment. Consequently, the inhibition of host factors that are also crucial for the host cell can be achieved even though it is a matter of fine balancing between cytotoxicity to the cell and efficiacy against disease.

The proposed procedure to infer pan-pathogen host factors could aid in the development of broad-spectrum antiviral drugs for a group of viruses or even bacteria that could allow the treatment of multiple diseases (Table 2) with the same substance. In addition, our model generates estimates of gene effect sizes for the single viruses.

In this work we selected a group of positive-sense ssRNA for analysis. The replication cycles of any subgroup of positive-sense ssRNA viruses consist of notably similar steps and, given the similarities of how they replicate, we hypothesized that they share the same host dependency or restriction factors or, at least, the same pathways (hence the network analysis). While our model can be applied to any group of pathogens the success of finding relevant host-factors for a highly diverse group of pathogens is less unlikely. In addition the experimental design of such a study, a factor which we did not emphasize enough, is critical: contributing factors might be quality of interventions, number of replicates, or the type of readout, e.g. GFP signals of viral growth or cell death, or even sequencing data in CRISPR screens.

Our two-stage procedure has also some limitations. In our case the integrated data set showed strong heterogeneity and variance between the different biological conditions which necessitated the inclusion of random effects. For data sets with less variance a random effects model might not be needed at all. Moreover, utilizing biological prior knowledge in the form of protein-protein interaction networks could possibly bias and corrupt results, especially when networks with incorrect edges are used. The use of multiple, different networks may improve this situation.

Since we apply a stochastic approach for network diffusion we cannot gain information about whether genes are dependency or restriction factors. This could be addressed by developing a network diffusion model applying state probabilities for pro-viral and anti-viral effects. Finally our method does not provide estimates for statistical significance for the genes, but only a ranking of genes.

Currently our model can be used for RNAi screens with continuous readouts, but can readily be generalized to sequencing-based perturbation screening methods, such as CRISPR, where read counts are usually modelled as negative binomial or Poisson random variables.