Disease emergence and bioterrorism events, especially since 2001, have highlighted some of the short-comings of traditional surveillance, generally based on laboratory test results and direct reporting. Focus has shifted to earlier detection of pathogen introduction in human or animal populations, leading to the implementation of new techniques using data sources upstream to those typically used in traditional surveillance; especially pre-diagnosis data that are already available and automatically collected, such as sales of over-the-counter medicine, absences from work or school, and patients’ chief complaint upon visits to an emergency center.
Due to the lack of sensitivity of pre-diagnostic data, surveillance systems using this information target general groups of diseases, or syndromes, and are therefore often referred to as “syndromic surveillance”. Grouping pre-diagnostic data into syndromes is the first step of implementing a syndromic surveillance system. Valid, reliable, and automatic classification of syndromes was an essential component of early computerized epidemic detection systems. When data are structured using standardised codes, such as the Logical Observation Identifiers Names and Codes (LOINC®) used in laboratories, the International Classification of Diseases (now on its 10th revision, ICD-10), or the Systematized Nomenclature of Medicine (SNOMED®), syndrome classification can be performed by mapping those codes into syndromes. However, text mining or other machine learning tools can be invaluable when free-text or semi-structured data are being used. Naïve Bayes classifiers have frequently been used in syndromic surveillance when the input data are chief complaints (free-text typed in by nurses) at emergency facilities,,,,.
Rule-based methods were widely used before the computational capacity of common computers made it possible for machine learning methods to be widely adopted. Nevertheless, they have remained a popular choice in the health field due to their transparency and interpretability. In the 2008 challenge organized by i2b2 (Informatics for Integrating Biology to the Bedside), which consisted of automatic classification of obesity and comorbidities from discharge summaries, the top ten solutions were dominated by rule-based approaches, demonstrating their efficacy.
Decision trees are a third type of classification algorithm recommended when results must be delivered to a broader audience, such as health workers, as it is also an relatively simple method to interpret. Other machine learning algorithms used in the medical field include: Artificial Neural Networks (ANN); and Support Vector Machines (SVM). These methods are powerful, but both adopt a “black-box” approach; so that the way in which decisions are made by the classifier is not transparent. They have been used in more complex medical tasks, such as the interpretation of radiographs and studies of drug performance,,. However, to the authors’ knowledge, the use of these algorithms to classify health data for the purposes of syndromic surveillance has not been documented in the peer-reviewed literature.
In contrast to laboratory test results, on which traditional surveillance is based, laboratory test orders can be a valuable data source for syndromic surveillance, since they are collected and stored electronically in an automated manner, but are more timely for surveillance purposes than laboratory test results. Laboratory submission data have, for example, been incorporated into CDC’s BioSense Early Event Detection and Situation Awareness System. Moreover, because there are fewer laboratories than sites of clinical care, the use of laboratory databases can provide more complete records and over larger areas. Besides changing focus to early diagnosis, modern surveillance systems are evolving to complete biosurveillance systems. This term is intended to imply a broadening focus, addressing not only human health but all conditions that may threaten public health, such as a disruption in the food supply, or large social and economic disruptions resulting from outbreaks of diseases in animals,. Besides their role in the food supply and agricultural economy, animals could serve as sentinels for the detection of certain zoonotic diseases that may be recognized earlier in animals than in humans.
Animal data have been incorporated into a few surveillance systems for human populations, including: the Electronic Surveillance System for the Early Notification of Community-based Epidemics (ESSENCE), the North Dakota Electronic Animal health Surveillance System and the Multi-Hazard Threat Database (MHTD). Glickman et al (2006) and Shaffer et al (2008) have investigated the value of animal health data as sentinels for public health. Despite the less frequent requests for laboratory analyses made by veterinarians compared to human clinicians, the authors hypothesized that, “the consistency of test orders over time is such that increases in cases of disease will result in detectable increases in the number of test orders submitted by veterinarians that can be identified using prospective analysis” (Shaffer, 2008, page2).
An overview of the development of syndromic surveillance system in the veterinary context has been provided in a recent review of the literature. This review indicated that initiatives using laboratory data had been based on establishing direct relationships between test codes and syndromic groups. The use of clinical data has typically relied on syndrome definition being provided by the veterinarian. Machine learning or rule-based methods applied to the identification of syndromes in animal health data had not been documented. This paper describes the exploratory analysis of such methods to extract syndromic information from laboratory test requests submitted to a veterinary diagnostic laboratory. These steps are part of the development of a syndromic surveillance system taking advantage of the centralized, computerized, and routinely updated sources of data provided by the Animal Health Laboratory in the province of Ontario, Canada. The initial phase of implementation, described here, focused on cattle sample submissions.
The Animal Health Laboratory (AHL) at the University of Guelph is the primary laboratory of choice for veterinary practitioners submitting samples for diagnosis in food animals in the province of Ontario, Canada. The number of unique veterinary clients currently in the laboratory’s database (2008 to 2012) is 326. The AHL has a laboratory information management system (LIMS) that is primarily used for reporting the results of diagnostic tests.
Three years of historical data from the AHL were available, from January 2008 to December 2010. Cattle were chosen as the pilot species due to high volume of submissions from dairy and beef herds in Ontario. All laboratory test orders for diagnoses in cattle were extracted from the database; all farm identification elements had been removed from these data.
Test requests are entered into the AHL database on a daily basis. Individual test requests are recorded as unique data entries. A common case code (submission number) is given to all samples from the same herd on any given day, allowing identification of samples related to the same health event. In human health, a case usually refers to one person at a time. Such that two people, with the same medical complaint, living in the same household, submitting samples on the same day would be counted as two cases. In veterinary medicine which often works in herds or flocks, samples submitted from one, two or more animals, of the same type, from the same herd (“household”) with the same medical complaint on the same day, would be counted as one case.
The nature of the diagnostic sample is identified in the database by two fields: the sample type field, in which the laboratory staff chose from a pre-set list (blood, feces, brain tissue, etc); and the client sample ID, a free-text field used to enter the source animal identifier given by the client. The diagnostic tests are identified by codes pre-set in the system. All codes are textual.
Table 1 shows a sample of the data. Only the fields relevant for medical information extraction are shown. Submission numbers have been removed, but samples from the same submission are represented in the table with consecutive rows in the same shading.
All of the historical data available were reviewed manually to identify the potential for syndromic classification at the time of sample submission. Veterinarians do not often provide detailed case history information. Therefore the identification of syndromes was based only on the type of diagnostic test requested, and the type of sample submitted, which allowed identification of the organ system targeted for diagnosis.
A syndromic group was defined as a group of test requests that: (i) are related to diseases from the same organ system; (ii) are all diagnostic tests for the same specific disease, in cases of tests requested so frequently that their inclusion in another group would result in their being, alone, responsible for the majority of submissions; or (iii) tests that have little clinical relevance and should be filtered out (e.g., tests in environmental samples, general haematology profiles, as well as a range of “non-specific” submissions). Despite the absence of clinical information, the sample description allows identification of abortion cases through keywords such as “placenta” or “fetus”. “Abortion” is therefore the only syndromic group defined based on a clinical syndrome, rather than using the three criteria listed above. Based on those criteria, an initial list of syndromic groups was compiled and then reviewed by a pathologist (BJM), a bacteriologist (CAM) and a clinician (DK). Following this review, all historical data were manually classified into syndromic groups to serve as training examples for the machine learning algorithms. Syndromic definition and manual classification were discussed until consensus was achieved among all experts.
Each submitted case (one or more test requests from a herd on a given day) could have multiple types of samples and/or multiple diagnostic tests requested. Syndromic classification was performed for each individual database entry (test request), and later collapsed by case submission numbers, eliminating repeated syndromes within the same case. As a result, a given case could be associated with multiple syndromes by virtue of clues relating to multiple organ systems found in the same submission.
Mapping of Test Codes
Based on the aforementioned list of syndromic groups, a list of all diagnostic test codes that could be mapped into a syndromic group was established. Mapping is used here to describe the direct relationship: “if test requested is X, then syndromic group is Y”, and mapping rules of this type were established for all test request codes that could be classified into only one syndromic group with certainty. This is typically the case for serological tests, where the veterinarian specifies the pathogen or disease to be confirmed, and the sample type is not informative of the organ system affected, as it is “serum” or “blood”.
This mapping was built as a model in RapidMiner 5.0 (Copyright 2001–2010 by Rapid-I and contributors), an open source data mining package, which provides tools for data integration, analytical ETL (extract, transform, load), data analysis and reporting. RapidMiner includes an option to code any learned model in XML format, which can subsequently be directly manipulated.
Observations where test code was not associated with any mapping rule were assigned “Unknown” as the syndromic group at this stage in the processing. These were test requests such as “bacterial culture”, which are not informative of the disease suspicion or organ system targeted by the veterinarian. These observations formed an unmapped subset of the data.
Algorithms for Automated Syndrome Classification
For the unmapped subset, text mining was used to separate all words found in the fields describing the sample type (client sample ID and sample type, Table 1) in the three years of available data. A tokenization process was applied using any non-letter character as a break point to separate words. The list of all mined words in the historical data was manually reviewed to construct a dictionary of medically relevant terms, as well as acronyms frequently used, and common misspellings. This is similar to the process described in and.
Once the dictionary was built, all data tokenization was performed searching only for those specific tokens. For each observation being evaluated, the fields sample type and client sample ID were tokenized, and a vector was created to designate the binary occurrence of each word in the dictionary. These vectors were then used by the classifier algorithms to learn from the training dataset and to classify test data.
The rule induction algorithm in RapidMiner [Repeated Incremental Pruning to Produce Error Reduction (RIPPER)] was used. Information gain was used as the criterion used for selecting attributes and numerical splits. The sample ratio and pureness were set at 0.9 and the minimal prune benefit 0.25. Using the XML model of rules induced by the RIPPER algorithm as a template, a manually modified set of rules was also explored.
The Naïve Bayes learner available in RapidMiner was used to develop and apply a Naïve Bayes classifier. The learner requires no parameters settings other than an indication of whether a Laplace correction should be used to prevent high influence of zero probabilities. Laplace correction was not used.
Decisions trees were constructed using gain ratio as the criterion for selecting attributes and numerical splits. The minimal size for split was set at 4, minimal leaf size 2, minimal gain 0.1, maximal depth 20, confidence 0.25, and up to 3 pre-pruning alternatives.
The XML code of the models used, as well as the set of customised rules for classification, are available upon request from the first author.
Assessing Algorithms Performance
Due to the large variability in the free-text entered by veterinarians to describe the samples submitted, it was deemed important to have a large test set, in order to assure that classification would be satisfactory once applied to new data. Manually classified historical data were split in half. After sorting sample submissions according to date and submission number, observations were alternately assigned to two different sets. Each classification algorithm was trained using one of the two sets, and then used to classify the alternative set. The process was then repeated switching training and test subsets.
Based on a comparison to the manual classification which had been carried out with the help of experts, the following performance measures were assessed for each classifier (using overall results from both test datasets): recall (the fraction of relevant instances correctly identified by the algorithm); precision (the fraction of the identified instances that were correct), and F1-score, the harmonic mean of recall and precision; i.e. (2 * precision * recall) * (precision+recall)−1. After computing recall, precision and F1-score for each of the classes, these measures were averaged over all classes to give macro-averaged scores. An average weighted according to the number of records in each of the classes was also calculated; often referred to as micro-averaged scoring.
Stability was investigated by producing slightly different training subsets (for instance removing small random samples from the training set, or eliminating individual syndromic groups at a time), and assessing the resulting difference in the performance of the classifier.
The three years of historical data contained 23,221 cases (samples from the same herd on a given day), consisting of a total of 218,795 individual test requests from cattle (i.e. bovine, dairy or beef animals of any age).
Based on an evaluation of these three years of historical data, and input from experts, the syndromic groups listed in Table 2 were defined. The table also lists the criteria for syndromic group creation and the number of test requests and cases assigned to each syndromic group following manual classification.
After classifying all sample submissions, and eliminating repeated syndromic instances within the same case, the final number of “syndromic cases” in the historical dataset was 30,760. Given that there were 23,221 initial herd investigations, this implies an average of 1.32 recorded syndromes per case. The distribution of syndromes per case is shown in Figure 1.
Of all the samples submitted, 75.7% (165,649) could be directly mapped into syndromic groups based on the test request information alone.
For the syndromic groups created based on clinical signs, non-specific signs or specific organ systems (see Table 2), Figure 2 illustrates the percentage of test requests which could be allocated to a syndromic group via direct mapping versus those that fell into the unmapped subset. Around 25% (53,146) of all instances in the database could not be directly mapped into a syndromic group and these provided the material for which automated classification was explored. Although these unmapped instances contain 16 of the original 22 defined syndromic groups, the syndromic group “Mastitis” alone is responsible for over 70% of these instances, and three groups (“Mastitis”, “Nonspecific” and “GIT”) account for over 90% of the data, as shown in Table 3. For the groups Mastitis and GIT, 94% and 77% of the unmapped observations, respectively, refer to the test “Bacteria culture”. Unmapped observations which are ultimately classified as “Nonspecific” contain a greater variety of test names, including the following which occur frequently: “Bacterial culture” (18%), “Histology” (27%) and “Necropsy” (18%).
The results of automated classification using different algorithms are shown in Table 4 and described in detail below.
The use of rule induction (RIPPER) achieved only moderate performance overall. Three groups with low frequency of test requests – “Environmental samples”, “Skin”, and Eyes and Ears” – were not included in the rules, but as shown in Table 3 these groups represent only 0.3% of all instances subjected to automated classification. The F1-macro average was 0.677, but because the unlearned groups account for such a small proportion of the submissions, when the classes’ performance is averaged accounting for the weight of each class, the F1-micro is 0.979 (Table 4). Upon manual review of the rules created by the algorithm, it was found that the main source of error was failure of the algorithm to establish good decision rules when multiple medically relevant words were found in the same test request. This method was easy to implement and the rules generated are transparent and easily interpreted.
The rules produced by the RIPPER algorithm were manually modified to account for some of the relationships missed, producing a set of custom rules. Running the custom rule set against the entire unmapped subset resulted in an F1-macro score of 0.997, and F1-micro score of 0.9995 (Table 4). The remaining errors tended to be due to use of abbreviations not common enough to have been incorporated in the rules, misspellings or the absence of a space between two words, resulting in the tokenization process failing to identify these words.
The performance of the Naïve Bayes algorithm was high (F1-macro of 0.955 and F1-micro 0.994), as shown in Table 4. The main performance issue associated with this algorithm was its instability. Slightly different datasets resulted in very different performances (results not shown). With unbalanced training and test datasets, for instance, rather than assigning the label “Nonspecific” to samples that could not be classified, the Naïve Bayes algorithm would assign these samples, as well as misclassified samples from other groups, into one of the groups with a small number of submissions.
The classifier based on Decision Trees performed reasonably well in the micro score (F1-micro score of 0.923). However the classifier failed to learn 9 classes, which are biologically relevant, despite accounting for only 2% of the unmapped instances (which explains the high micro average). Moreover, the models appeared to be unstable: slight changes in the training data could result in a completely different ‘shape’ of decision tree, and a similar phenomenon was observed when the initial parameters for minimal gain and confidence where varied.
Real-time monitoring of animal health data depends on establishing reliable models that reflect medical knowledge and that can be applied in an automated manner. Such models should be efficient, but also comprehensible to end users.
In this study the structured format of laboratory data, and the use of standard test codes, allowed for classification of approximately 75% of test requests into syndromic groups using direct mapping. For the remainder of the data, high accuracy (F1-macro = 0.997) was achieved through the use of a rule-based syndrome classifier. Induced rules were manually modified during the construction phase, but resulted in clear interpretability of decisions and resulting classification. While the use of rules was easy to implement and interpret, the construction of a dictionary of medically relevant terms and the manipulation of rules were time-consuming steps. Implementation of similar systems making use of other sources of laboratory data should be easier facilitated as standardized languages are more widely adopted in animal health laboratories, avoiding the repetition of this process for every new database.
The use of a custom rule set limits the potential for automatic revision of the classification model. Further research is required to establish internal validation rules, possibly based on the results available from historical data, in order to define automated ways to carry out model updates in the future.