Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹COVID-19 risk

In silico approach to screen compounds active against parasitic nematodes of major socio-economic importance

Recent efforts in anthelmintic drug discovery

An excellent review on the current anthelmintics and existing research gaps that need to be addressed in order to discover novel anthelminthic drugs are summarized recently by Keiser and Utzinger. Kaminsky et al.[13] reported a new class of synthetic anthelmintics, amino–acetonitrile derivatives (AADs) that are active against a variety of livestock pathogenic nematode species. The authors reported that the optimized AADs were able to eliminate fourth larval stages of H. contortus, T. colubriformis in sheep and Cooperia oncophora, Ostertagia ostertagi in cattle at a single oral dose of 20 mg racemate kg-1. The authors surmised that a unique group of nematode specific nAChR protein from acr-23 gene is responsible for AAD efficacy. Hu et al.[14] have demonstrated that the mechanism of action of a novel anthelminthic drug, tribendimidine, approved recently in P.R. China. They concluded that tribendimidine is an L-subtype nAChR agonist, similar to levamisole pyrantel. The anthelminthic properties of cyclooctadepsipeptides have also been reported recently in vitro and in vivo[15,16]. Mefloquine is an antimalarial drug and has been used successfully for past four decades to treat prophylaxis of malaria. However, recent research revealed promising antischistosomal properties of mefloquine in Schistosoma mansoni- and Schistosoma japonicum-infected mouse models. Ponce-Marrero et al. introduced a novel approach for in silico design of new anthelmintic drugs using linear discriminant analysis to obtain a quantitative model that classified anthelmintic drug-like from non-anthelmintic compounds. The developed model correctly classified 88.18% of the compounds in external test set. The model was then used for virtual screening and several compounds from Merck Index and Negwer’s handbook were identified by the model as anthelmintic. Subsequently in vivo test were carried out to validate the predictions.

Overview of the ligand-based virtual screening methods

Antiparasitic drugs historically have been discovered by experimental screening against intact parasites, but due to the enormity of the task and availability of better computational facilities there has been a shift towards computational screening. Computational screening (also known as virtual screening) has inherent advantage over traditional and even experimental high throughput screening (HTS) due to its massive parallel processing ability; millions of compounds per week can be tested. Virtual screening (VS) has been widely used to discover new leads by computationally identifying compounds with higher probability of strong binding affinity to the target protein. Successful studies have led to the identification of molecules either resembling the native ligands of a particular target or novel compounds. VS methods can be classified into structure-based and ligand-based approaches based on the amount of structural and bioactivity data available. If the 3D structure of the receptor is known, a structure-based VS methods that can be used is high-throughput docking but where the information on the receptor is scant, ligand-based methods like similarity searching and machine learning techniques are commonly used. Docking involves a complex optimization task of finding the most favourable 3D binding conformation of the ligand to the receptor molecule. Being computationally intensive, docking is not suitable for very large virtual screening experiments. On the other hand, ligand-based methods are popular because they are computationally inexpensive and easy to use. Furthermore, the assumption that structurally similar molecules exhibit similar biological activity than dissimilar or less similar molecules is generally valid. Thus, ligand-based methods are increasingly playing an important role at the beginning of the drug discovery projects especially where little 3D information is available for the receptor. Particularly interesting are machine learning based approaches such as neural networks, genetic algorithms and support vector machines (SVM). SVM is a powerful classification technique that has found numerous applications in chemistry such as drug design, quantitative structure property prediction and chemical data mining. Many studies in the past have shown SVM to be one of the best methods for correctly classifying molecules. Zernov et al. used SVM and neural networks to predict the drug-likeness and agrochemical-likeness for large compound collections. They showed that for both kinds of data, SVM outperformed all neural networks under the same training conditions. Warmuth et al.[25] investigated a large collection of compounds to find those that bind to the target of interest in as few iterations of biochemical testing as possible. The authors compared various search strategies including maximum margin hyperplane, generated by SVM. They concluded that the strategies based on SVM clearly outperform the simpler ones. Similarly, Burbidge et al. carried out a comparative study that involved prediction of the inhibition of dihydrofolate reductase by pyrimidines, using SVM, ANN and decision trees. They found that SVM outperformed the other methods, except in a manually capacity-controlled ANN, which required significantly longer training time. Nonetheless, ligand-based VS still remains an unproven approach in the discovery of antiparasitic medicines.

In this investigation, we have developed an in silico classification model using SVM to predict potential anthelmintic leads targeted towards parasitic nematodes. Our model has an estimated accuracy of ~82.0% for the test dataset. We have applied this model to a large public database to predict novel anthelmintic compounds and identified a set of 45 compounds, of which six are promising as potential therapeutic agents.

Preparation of the dataset

The quality of the data available largely determines the quality of any machine learning model. Our primary dataset contains 295 unique compounds (148 actives and 147 inactives). The library of active molecules (compounds active against parasitic nematodes) was carefully collated from PubChem and other literature sources. For inactive compounds, we searched the DrugBank database for similar molecules to the ones present in the active set with a Tanimoto cut-off range from 0.25 to 0.75. As a result, compounds from various pharmacological uses (anticancer, antibacterial, sedatives, antifungal) were collected into the inactive dataset. Since no true negatives (compounds without any anthelmintic activity) are reported in the literature, inactive compounds used in this study may possess residual anthelmintic activity. In Figure 1, we present representative active and inactive compounds used in this study for developing models. Further, the primary dataset was divided into training (80%) and testing sets (20%). The sampling was carried out at random and compounds in the test set were excluded from model development. In Table 2, we present the composition of the datasets used in this study. The training dataset was used for optimizing and training the SVM classifier in order to predict compounds from an unseen test set. The training dataset contains 240 compounds (126 active and 114 inactive). The test dataset on the other hand was used for evaluating the performance of the SVM method and contains 55 compounds (22 active and 33 inactive). All the training set and test set compounds are available in Additional file 1. Based on our previous study, where we reported that the ChEMBL database is quite diverse, contains many drug-like and interesting compounds, therefore, we used the ChEMBL database compounds for prediction set. Currently, the database holds over 650,000 compounds with calculated physicochemical properties (log P, molecular weight, Lipinski properties) and abstracted bioactivities (binding constant, pharmacology and ADMET data). We downloaded the ChEMBL dataset in SD format. After cleaning the dataset of any inconsistencies and inorganic structures, we removed the compounds with 0.8 or greater Tanimoto similarity to the compounds in primary dataset. Then we clustered the dataset to remove similar structures. Cluster centres were selected from each cluster while singletons were retained as such. For clustering, we employed the functional class substructural fingerprint as implemented in Pipeline Pilot software with the Tanimoto cut-off value 0.7. This reduced our dataset to around 300,000 compounds. Finally, we randomly selected 10,000 compounds from ChEMBL dataset for descriptor calculation and further analysis.

Defining scaffolds

In order to study the patterns in chemical compounds, it is important to decompose the molecules into fragments. There are a number of ways to fragment molecules as discussed elsewhere. We describe below the specific method used in this study to obtain molecular scaffolds, where the term scaffold describes the core structure of the molecule (carbon skeleton). To obtain the carbon skeleton of the molecule, all the heavy atoms are represented as carbon and all bonds are converted to single bonds as shown in Figure 2.

Descriptor calculation and selection

The determination of relevant features is an important step in any machine learning process. Moreover, with hundreds of descriptors available it is essential to choose the best subset of descriptors because many of the descriptors are noisy and some are irrelevant to the target activity. Feature selection is the effective way to remove noisy or irrelevant descriptors and reduce the dimensionality of the feature space to avoid overfitting. This leads to simple and robust computational models with improved prediction accuracy.

There are two main approaches for feature selection in a supervised learning context. The first one is the filter approach. It consists of selecting the best subset of features in an independent way, with ad hoc criteria. Filter methods are fast and can be easily implemented; however, there is no guarantee that the best subset of descriptors has been selected. The second method is the wrapper approach which evaluates the performance of a predetermined learning algorithm and uses it as an evaluation criterion to select the optimum subset of features.

The Molecular Operating Environment (MOE) software was used for descriptor calculation. It calculates 333 descriptors, which are classified as one-dimensional (physicochemical properties), two-dimensional (topological) and three-dimensional (volume and surface area) descriptors. In Figure 3 we show the overall methodology adopted for descriptor calculation and selection. Due to the large number of descriptors available, we first filtered out constant and near constant descriptors (descriptors with <0.3 standard deviation). This resulted in the removal of 81 descriptors. Following this, we removed descriptors with a correlation coefficient greater than or equal to 0.8. The removal of correlated descriptors resulted in a set of 113 descriptors. Before performing univariate analysis, we normalized the dataset using the z-transformation. We then performed the normality test and those descriptors that passed the normality test were retained while the others were rejected. This reduced our previous set of 113 descriptors to 34 descriptors. For further selection of descriptors, we used the Stepwise Discriminant Analysis (SDA) using a free data mining tool Tanagra. SDA is often associated with discriminant analysis but in fact, it can be applied to various linear models such as linear SVM and logistic regression. However, it is not suitable for non-linear models such as multi-layer neural networks and nearest neighbours. We implemented SDA with both forward and backward elimination strategies. In the forward approach, at each step, all the variables are evaluated to determine which variables contribute maximum to the discrimination between the groups. Variables with significant contributions are included and the process starts again till there is no attribute to add to the model. In the backward approach, all the descriptors are included in the model and then, at each step, the descriptor that contributes least to the discrimination is eliminated, terminating when there is no descriptor to remove. For our problem, we found that the forward approach performs better than the backward elimination strategy. We used F statistics as the termination criterion, with a predefined threshold value of 3.44, where the F value for a descriptor indicates its statistical significance to discriminate between the positive and negative data groups. This resulted in the selection of final 14 descriptors out of 34. In Table 3 we present the final 14 descriptors used in this study.

SVM algorithm

The SVM algorithm was developed by Vapnik. Recently, SVM has been applied to chemoinformatics, due to its robustness and ability to classify objects into two classes as a function of their features. Many studies in the past have shown SVM to be one of the best methods for correctly classifying molecules. A standard application of SVM involves defining two classes of objects, determining the set of features that distinguish these objects and use the trained SVM model to predict the classes of unknown data. Detailed accounts of the SVM methodology are present in literature. Briefly, SVM is a new algorithm and is based on structural risk minimization principle from statistical learning theory. Each molecule to be classified by SVM is represented by a feature vector xi (i=1,2…N) of M real numbers (descriptors) with the corresponding label yi ε {+1,-1}, where yi = -1 means inactive and yi = +1 means active. To classify the data, the SVM attempts to find the optimal hyperplane {x ε Rm: w.x +b =0} that best separates the input data into two classes in M dimensional space. The optimal hyperplane is defined in such a way that margin of separation between positive {x ε Rm: w.x +b ≥ 0} and negative {x ε Rm: w.x +b ≤ 0} examples is maximized with minimal error; where w is the normal vector of the hyperplane and b is the scalar. In other words, the optimal hyperplane passes through the “midpoint” between these sets. The decision function for new predictions on unseen examples is given in equation 1:(1)

where K(xi.xj) is the kernel function and the parameters are determined by maximizing the following equation 2:(2)

under the conditions (equation 3):(3)

The penalty constant C serves as a regularization parameter and represents the trade-off between minimizing the training set error and maximizing the margin. Higher number of support vectors is due to a small C and vice versa. If we use a very small C value, then almost all the samples would influence the model equally to build a decision boundary regardless of their position. As a result, virtually all the samples become support vectors. On the other hand, if we use a large C it may cause overfitting.

Since there are different types of kernels present (linear, polynomial, radial basis function, sigmoid) we explored various kernels for the efficacy of SVM prediction. From our analysis we note that radial basis function (RBF) kernel (equation 4) was found to be most effective (data not shown) therefore we have chosen the RBF kernel for further analysis.(4)

Two parameters viz., γ which determines the capacity of the RBF kernel and the regularization parameter, C are required for optimization of SVM classifiers. To optimize the SVM parameters, C and γ, we carried out an extensive grid search to build accurate models. The resulting optimized parameters were C = 1.4 and γ = 0.43.

Model validation

The prediction accuracy of the models developed was tested using ten-fold cross-validation technique. In a ten-fold cross-validation, the dataset was split into ten subsets of equal proportions. One of the subsets was used as the test set while the rest were used for training the classifier. The trained classifier was tested using the test set. This was repeated ten times using a different subset for testing and thus ensuring that every compound was used in prediction once.

Performance measure

The prediction results from SVM were evaluated for the test dataset using the following statistical measures.

• TP, true positive – the number of correctly classified active compounds.

• TN, true negative – the number of correctly classified non-active compounds.

• FP, false positives – the number of incorrectly classified non-active compounds.

• FN, false negative – the number of incorrectly classified active compounds.

Using the variables above, a series of metrics were computed sensitivity (SN), specificity (SP), balanced accuracy (BA), F−measure and Matthews correlation coefficient (MCC).

The recall rate for the members of positive class (actives) is given by sensitivity, equation 5:(5)

Similarly, the recall rate for the members of the negative class (inactives) is given by the specificity, equation 6:(6)

Accuracy measures the ratio of correct predictions to the total number of classes evaluated. We calculated balanced accuracy which is given by the equation 7:(7)

Further, we calculated the F−measure, which is given by equation 8:(8)

Finally we calculated MCC from equation 9; the coefficient returns a value between +1 and -1. The higher the value of MCC, the better the classification result.(9)

Results and discussion

The main aim of this study was to classify and predict novel compounds active against parasitic nematodes. The various molecular descriptors (333 in total) were calculated initially, using MOE. After removing insignificant attributes (standard deviation ≤ 0.3) and applying a correlation test with a cutoff value of 0.8 we were able to reduce the total number of attributes to 113. Subsequently the SDA algorithm was applied and finally a set of 14 descriptors was selected for the development of classification model (details in Methods section).

The obtained model correctly classified 87.56% of the active compounds and 85.30% of the inactive compounds with the overall accuracy of 86.43% in the training set while 81.82% in the test set. The F−measure of the training and test sets are 86.52% and 79.17% respectively. Table 4 shows the result of the classification for the training and testing sets. All the predicted compounds can be found in Additional File 2.

The machine learning systems such as this could clearly reduce the cost involved in experimental methods involved in drug discovery pipeline. As the SVM algorithm has been effectively applied in various classification problems, we investigated the utility of SVM approach for the prediction of potential anthelmintic lead compounds. The accuracy of the model on the training dataset may indicate the effectiveness of a prediction model however; it may not be able to accurately show how the model will perform on novel compounds. Therefore, it is critical to test the model on an independent dataset, not used in training. In our case we trained and optimized the SVM classifier separately using the entire training set and evaluated the model on the test set. As shown in Table 4, the SVM model obtained an accuracy of 81.79% for the test set. On careful examination of our prediction result, we find that structural similarity of many false positives to the compounds in the active set is quite high, which may suggest a lower accuracy figure for the test set, due to our stringent threshold values. Further, we also note that a few false negatives are at the borderline and are thus classified as inactive by our model. To best of our knowledge, there are not many reported studies on the prediction of anthelmintic compounds therefore we were able to compare our results with only one study. We find that our results are comparable to that study. Marrero-Ponce et al.[19] used linear discriminant analysis to classify anthelmintic drug-like from non-anthelmintic compounds. The authors reported the accuracy of around 90.4 % in the training set while 88.2% in the test set which is slightly higher than ours. However, we believe our model is more robust because our selection criterion to pick inactive compounds was quite stringent. We selected molecules within the Tanimoto range of 0.25 to 0.75 of the compounds present in the active set which would make it relatively difficult to classify than if chosen randomly. The idea was to build a robust model that can classify compounds into separate groups even with structural similarity. Further, we surmise that since DrugBank covers most of the FDA approved drugs, the inclusion of DrugBank compounds in our inactive dataset would allow us to navigate to the unexplored regions of drug-like chemical space.

The results obtained are particularly interesting from a clinical perspective. From our scaffold analysis we note that even though the size of both the dataset (active and inactive) is approximately same, the number of unique scaffolds found in the inactive set is almost twice the number of unique scaffolds found in active set. This clearly indicates that the inactive set is more diverse than the active set. The number of unique scaffolds, along with the relative percentage according to the total number of molecules present in the dataset is reported in Table 5. In Figure 4, we report the top ten molecular scaffolds in both the datasets. We note that, over 70.0% of the active compounds are represented by the top 10 scaffolds whereas only 51.1% of the inactive compounds are represented by the same number of scaffolds. This again suggests high scaffold diversity in inactive dataset. It should also be noted that five of the top ten scaffolds shown in Figure 4 are shared by both datasets.

In the 45 predicted compounds, we note that piperazine-like substructures appear frequently suggesting that the nitrogen atom in the piperazine ring might be involved in binding to the receptor. Figure 5 shows an example set of predicted active compounds. Also, we note that many predicted compounds either contain benzimidazole scaffold or are derived from it e.g. in Figure 5, six compounds out of twelve are a derivative product of the benzimidazole scaffold. This shows the validity of the above method since the benzimidazole class of compounds are well recognized for anthelmintic activity. Further, we searched the ChEMBL database for the binding affinity, assay type and target information of the identified compounds. We note that many predicted compounds bind to targets of interest in model organisms but experimental validation in the case of nematodes needs to be further carried out. Out of the total 45 predicted compounds six compounds are particularly interesting. Compound 3 with antiviral activity, compound 10 with inhibitory activity against Ancylostoma ceylanicum (a nematode), compound 12, compound 37 with antimicrobial activity against Staphylococcus aureus, compound 26 with activity to inhibit SARS-CoV 3CL protease enzyme and compound 40 with activity against Rhinovirus. In addition, there are compounds that bind to nicotinic acetylcholine receptor and tubulin β-1 chain in rats or humans. Since these two receptors are successful targets in nematodes, predicted compounds that bind to these targets can be used as leads to design novel compounds with high binding affinity to nematodes nicotinic acetylcholine and tubulin β-1 chain receptor.


We were able to compile an extensive dataset of anthelmintic compounds for the development and validation of support vector machine model. We thoroughly tested the SVM approach for identifying the potential compounds with anthelmintic activity. From our results we conclude that SVM method is well suited for the prediction of anthelmintic (or antiparastic) compounds. We were also able to identify a number of interesting compounds with potential activity against parasitic nematodes however; experimental validation of the predicted compounds is needed.

Authors' contributions

VK curated the datasets and conducted the analysis work, SR directed the study and both the authors prepared the manuscript.

Conflict of interest

None declared.