Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹COVID-19 screening

Toward unsupervised outbreak detection through visual perception of new patterns


Interest in syndromic surveillance was fueled in the recent years by the 9/11 attack on the US that revived fears of bioterrorism and by the threat of emerging diseases. Syndromic surveillance is defined as "an investigational approach where health department staff, assisted by automated data acquisition and generation of statistical alerts, monitor disease indicators in real-time or near real-time to detect outbreaks of disease earlier than would otherwise be possible with traditional public health methods"[1]. To achieve this goal, the places to observe and the databases to analyze must meet certain prerequisites: observations must be made where patients first seek immediate care, i.e., at general practitioners (GPs), hospital emergency departments (HED) or pharmacies. Data must be accessible in real time and collected on a routine basis. Good examples of such data are over-the-counter drug sales, visits to emergency care units, which, in most hospitals, are recorded in real time, or consultations with GPs in private practice, when a system of real-time data collection is available, as is the case for the French Sentinel Network.

Once the databases are available in appropriate settings, outbreak-analysis algorithms are needed. An outbreak is an unexpected cluster of cases of a certain category, given the past experience in the same place and under the same conditions. Thus, outbreak detection can be considered a problem of pattern recognition.

Since the advent of artificial neural network methods of pattern recognition, it has become classical to separate "supervised" and "unsupervised" methods. This distinction is particularly relevant in the context of outbreak detection: supervised methods are used when the pattern to recognize has been defined previously, e.g., the detection of seasonal influenza outbreaks. In this situation, data collection must rely on a well-defined set of symptoms. Then, a statistical algorithm is used to qualify when an excess of observed cases indicates an outbreak. Numerous statistical techniques belonging to this class of supervised methods are routinely used in surveillance (e.g., periodic regression models, which are now available on the Web), as long as the patterns to recognize have been defined a priori. For example, the Centers for Disease Control and Prevention (CDC) defined 11 classes of syndromes associated with bioterrorism, and several syndromic surveillance algorithms were devised to optimally assign each new case to one of those classes.

While the supervised approach is straightforward, by definition, it only identifies those events that have been defined a priori. A second class of pattern-recognition techniques is the class of "unsupervised methods". In this class, the patterns to separate have not been defined previously. The challenge is to distinguish them – when they exist – from background "noise". In epidemiological terms, unsupervised approaches are suited to the detection of outbreaks of emerging diseases, for which no prior description is available, or bioterrorist attacks using unconventional biological weapons, i.e., modified biological agents causing novel unknown symptoms. Supervised methods are inapplicable in this critical role of epidemiological surveillance.

Herein, we describe an unsupervised outbreak-detection method that relies on the human visual capacity to detect new patterns. This strategy is based on two components: the first is an adequate visual representation of the clinical encounters during outpatient consultations to a GP or a HED; the second is human, in that we suppose that "epidemiology watchers" could be trained to identify the novel patterns on their air-controller-like monitors, which correspond to the new epidemiological events of potential interest.

Outline of the method

The principle of the method relies on the translation of medical linguistic information collected during the consultation into a visual signal. To do so, the first step is to encode that information using standardized medical terminology. For this study, we chose the International Classification of Primary Care, 2nd version (ICPC-2), which was specifically developed to code the clinical consultations patients in general practice. Then, these code counts, corresponding to number of consultations with those symptoms, are presented within a graphic reference frame, which contains all the codes of the terminology ordered in such a way as to mimic body anatomy. This ordering facilitates the epidemiology watcher's interpretation of the images, that we call ICPCviews.


Each patient-practitioner encounter is described with a chain of linguistic information describing the chief purpose of the consultation. That information is then translated into ICPC-2 code. It can be coded automatically or by an expert.


In ICPC-2, the codes are ranged according to three axes: symptoms, diagnoses and processes. Within these axes, 17 social-nosological categories (social, psychiatry, neurology...) are defined. A total of 745 different codes comprise ICPC-2, compared to the 10,795 codes making up the 10th International Classification of Diseases (ICD-10). In this study, we used the symptoms and diagnoses corresponding to 685 codes.


The aim of the method is to provide an instantaneous visualization of the whole set of codes. Herein, we applied an approach that we previously used for other medical classifications: each ICPC-2 code is assigned to a cell located in the graphic reference frame defined below. A patient population is then represented as an image, in which each cell corresponds to one ICPC-2 code and the number of affected patients is materialized by the color of this cell.

The graphic reference frame (Figure 1A and Additional file 1) was built using three criteria. First, a binary criterion splits the reference frame into two symmetrical parts separated by a medial dividing line: the diagnoses are placed on the left and the symptoms on the right. Second, the different ICPC-2 headings are arranged vertically in successive rectangles representing a particular system, in which the codes are organized in rows corresponding to tumor pathologies, degenerative pathologies, traumas, inflammatory and infectious pathologies, pathologies specific to the system. Third, individual ICPC-2 codes are placed horizontally. The codes the easiest to recognize, because they are the most serious or because they correspond to well-defined nosological entities, are placed closest to the medial vertical line. For example, the code R06, epistaxis, is adjacent to the medial line while the code R09, sinus symptom, has a more distant position. An enlargement of the respiratory system codes is shown in Figure 1B.

Each color is coded by a number reflecting the frequency of the diagnosis/symptom, from white (absent) to bright red (= 255). Let's assume that, in a given cell, a frequency x was observed. The color code N(x) assigned to x is defined by N(x) = integer part of [(x/(Max - Min)) × 255], where Max and Min are the respective maximum and minimum numbers of medical consultations associated with the symptom codes corresponding to the image cells for this population of patients.


We used two databases to illustrate our method: one from GPs in private practice and the other from an HED.

Data from GPs were obtained through the French Sentinel Network which has monitored online a series of common communicable diseases since 1984 and, all patient referrals to hospitals by Sentinel GPs since August 1997. A program then converts, when possible, these referrals, which are expressed in free text, into ICPC-2 codes. A total of 17,896 consultations were notified between 1997 and 2004: half were automatically coded by the software, the other half were coded by a medical resident.

Data on outpatients consulting at an HED were obtained during 2006 (n = 45,055) in a major university hospital in Paris. The chief complaints of every consulting patient during 2006 were recorded in free text by the triage nurse. To determine whether characteristic patterns of influenza were visible before the 2006 outbreak, we selected 4 random subsets of 200 consultations each, corresponding respectively to: the week just before the flu period (week 3/2006), the week of the flu outbreak peak (week 6/2006), the week after the flu period (week 14/2006) and the rest of the year. The time of the epidemic peak and its duration were provided by the routine periodic regression software used for real-time data from the Sentinel Network. The chief complaints of these 800 patients were coded by an experienced ICPC-2-coding medical secretary.


The first example of epidemiological detection of events of interest obtained with the GPs' data, shown in Figure 2A, focused on the summer of 2003, when France was hit by a heat wave that killed more than 15,000 persons. A specific pattern (Figures 2A and 2B) corresponding to general and metabolic symptoms (fever, weakness, impaired general condition and dehydration) can be easily recognized.

We also analyzed the GPs' data concerning influenza-like surveillance over a 7-year period (see Additional file 2, which contains a slide show composed of 82 successive images). The visual changes of disease-associated color patterns generated by the successively entered codes that paralleled flu epidemics are apparent. Note that this example is not provided to proclaim that the visual method should replace in this instance the classical supervised statistical outbreak-detection methods, which are used routinely. It is given as proof-of-concept of the proposed unsupervised method, as it enables the recognition of the influenza outbreaks that are objectively defined with the supervised techniques.

In the second example, we used data from an HED to evaluate whether the technique was able to detect an outbreak of flu-like disease in a timely manner in another setting, with different patients, coded differently. This example also shows that standard methods of image analysis (such as subtraction of images, smoothing, etc) can be used in the present application. We examined the ICPCviews corresponding, as above, to: the week just before the outbreak (week 3/2006), the week at the outbreak peak (week 6/2006), the week after the outbreak (week 14/2006), and the rest of the year taken as a control period. We then subtracted the control ICPCview from the three ICPCviews being considered. The pattern for the week preceding the outbreak ICPCview clearly foresees that seen during the week of the outbreak peak (Figures 3A and 3B). This pattern combines general symptoms (fever) and digestive symptoms (abdominal pain and vomiting), which are what is to be expected during an influenza-like or gastroenteritis outbreak. At this point, field epidemiologists and reference laboratories are needed to finalize the characterization of the epidemic detected.


The usual syndromic surveillance methods are supervised and based on statistical tools. Herein, we described a novel method that could be used when the supervised approach is not applicable. That situation occurs when we are faced with the detection of "unexpected" events, which, by definition, are of major interest for epidemiological alert. Indeed, our primary goal is to help recognize, as early as possible, totally unexpected epidemiological patterns. The detection-triggering signal can be the mere increase of an isolated diagnosis code. Pertinently, in this case, a regression method with a threshold would have performed better only under the very restrictive condition that this code would have been identified in advance. However, the signal can also be an unusual association of different color patches on the monitor, which appear to be novel to the observer, and trigger an in depth epidemiological investigation.

Our proposed model is similar to what already happens in an air-traffic-control room: most of the routine tasks are now automated and the attention of the human observers is now focused on "unexpected" events. Likewise, we propose relying on the classical supervised methods for the usual situations that happen regularly (e.g., seasonal flu epidemics), and we seek to improve our detection of the unexpected epidemiological events that are extremely critical from a public health perspective, precisely because they are unexpected.

An important technical problem is the choice of the time resolution of the display on the monitor. For the French Sentinel Network, resolution time was the month: that timeframe is clearly irrelevant for prospective surveillance and was only used to show the potential of our method to recognize a very special event (i.e., the health impact of the 2003 heat wave). Similarly, for the HED, the choice of weekly resolution was only illustrative and was imposed by the numbers of data available per day, keeping in mind that several hundred cases are needed to create an informative image. In a real-world application, the choice of the temporal resolution would depend on the nature of the class of events to be identified: hourly resolution or, at worst, daily resolution would be desirable to recognize a terrorist attack-associated disease. The temporal resolution chosen also reflects the spatial resolution, with the number of cases observed indeed being a decreasing function of both spatial and temporal resolutions.

For example, the HED data we used in our example was collected in real time. The hospital that provided those data has ~150 consultations per day. Using a surveillance system based on the network of all Paris region public hospitals (Assistance Publique-Hôpitaux de Paris), which collects real-time data on 4000 patients per day (i.e., ~150/patients/hour) would, in contrast, empower a much shorter timeframe, of the order of a few hours.

Furthermore, for the method we propose, we chose to code diagnoses and symptoms with the ICPC-2 system, because it was developed precisely for primary care patients, who are the best target for surveillance of emerging diseases or bioterrorist attacks. However, the same paradigm developed herein could be used with other classification methods.

In a first test example, we showed that visual inspection of the ICPCviews obtained based on Sentinel Network GPs' transmissions during the 2003 heat wave in France would have likely raised suspicion that something unusual was occurring at that time. Indeed, in light of the public health and political scandal that ensued, it is highly rewarding that the images generated with our model heralded the high morbidity and mortality (later documented) that passed unnoticed. At the time of the event, the only public health warnings came from newspapers and funeral parlors, not from the health information systems, which were therefore far from the ideal real-time systems we described above. Imagine that the wall of monitors would have generated patterns similar to those seen in Figure 2, derived from data collected throughout the country. We are convinced that the trained "epidemiology watchers" would have detected the unexpected patterns and would have triggered the investigations that were so sorely lacking.

The second example we used was the detection of a flu-like outbreak. Flu-like symptoms are observed at the onset of many diseases, during bioterrorist attacks (e.g., smallpox, plague, anthrax), and for emerging diseases (e.g., severe acute respiratory syndrome, Chikungunya, flu pandemic,...). Numerous supervised techniques proved successful at recognizing seasonal influenza outbreaks, and the goal of our technique is not to compete with those methods in this situation.

Now, imagine an outbreak of influenza-like syndromes occurring in August, or the onset of a new disease heralded by symptoms, like epistaxis or purpura; the supervised methods would be, by definition, unsuited to detect them, while an unsupervised technique, like the one we proposed here, could work. Finally, the method is designed to have the highest sensitivity possible, in order to detect rare, unusual and unexpected signals. To achieve good positive-predictive value would require, in addition, a "back room", where human experts would validate the signals based on appropriate field epidemiological investigations.

One caveat of the method is that it relies, by definition, on human observers. Hence, its effectiveness will depend upon the quality of these observers and their training. The system's quality and that of the epidemiology watcher could be measured with a research protocol based on simulated datasets. This approach has been successfully used in epidemiological surveillance to test new algorithms. Simulated data sets could be generated by adding a given number of codes of interest (e.g., those compatible with an anthrax attack) to an existing database (e.g., the present HED database). Epidemiology watchers would then be shown the successive monitors displaying the evolution of the images within the graphic reference frame, and asked to indicate whether and when they could identify an outbreak. Such a design would allow easy computation the sensitivity and specificity of the system (as a function of the number of simulated codes added to the database). Standard statistical techniques would also allow assessment of intra- and interobserver variabilities.


If one accepts that an epidemiological alert system must be able to detect unexpected events, then huge efforts must be made to develop unsupervised methods precisely designed with this effect in mind. Herein, we described an attempt in this direction. The use of visual perception that we advocate here is not the only possible solution. Unsupervised pattern recognition is a prolific field of research that takes advantage of the ever-increasing power of computers and the new methods of machine learning. Those will be new avenues for epidemiological research into efficient warning systems.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

A-JV proposed using the present visual method for outbreak detection. PL defined the visual graphic reference frame and selected the databases used to exemplify the method. A-JV and PL wrote the paper jointly.

Pre-publication history

The pre-publication history for this paper can be accessed here: