Influenza is a major global health concern that accounts annually for tens of thousands of deaths in North America [1, 2] and approximately 400,000 deaths worldwide. Much effort has been invested in mathematical modeling of influenza dynamics in order to design improved control strategies [4, 5] and to estimate their economic impacts. A fundamental limitation of such studies is a lack of quantitative information concerning the relationships between influenza infection and health care system utilization. In particular, how should we expect changes in influenza transmission dynamics induced by control measures [5, 7, 8] to translate into (i) changes in influenza-associated visits to physicians’ offices or hospitals and (ii) health care system burdens from influenza-associated hospitalizations?
Here we reduce this knowledge gap by exploiting an unusual data set made available to us through a special agreement with Statistics Canada. These data include linked health records over a number of years, allowing us to estimate the distributions of times from influenza diagnosis to hospital admission, and hospital admission to discharge or death.
Three MOHLTC administrative databases were available to us: OHIP Ontario Health Insurance Program claims, as recorded in Medical Services Files. This database records all fee-for-service billings for physician services in Ontario; for each procedure, there is a service code and a diagnostic code. DAD Discharge Abstract Database, which contains hospital inpatient records (all data years) and day procedure records (data years prior to 2003/04), including diagnosis codes (one of which is flagged as “most responsible”). Inpatient records include the date of admission and the date of discharge or death. NACRS National Ambulatory Care Reporting System. Since 2003/04, Ontario hospitals have reported day procedure events to NACRS rather than the DAD.
All OHIP records included an EHIN, but a small number (≈2%) of DAD records were excluded from our analysis because an EHIN was not available (likely because the individuals in question were not Ontario residents at the time of hospitalization). For hospital events, we restricted attention to inpatients, so the NACRS database was not used.
Access was provided to data from 1994 to 2010. However, format changes and the difficulty of working with large volumes of data using the computing infrastructure available in the RDC made it impractical to analyze all the data. The data years used in the present study were 2006/2007, 2007/2008, 2008/2009, and 2009/2010. Thus, the data covered the three influenza seasons preceding the 2009 pandemic and the pandemic year itself. The data file format was identical for these four years.
We defined the start and end dates of data years via the “influenza year”, running from 1 April to 30 March of the next year. Thus, for example, “data year 2006” refers to the time period from 1 April 2006 to 30 March 2007. This definition captures the typical seasonal influenza epidemic in the northern hemisphere from November to March.
Relevant diagnosis codes
We restricted attention to records with diagnosis codes corresponding to influenza, pneumonia or other diseases of the respiratory system (Table 1). Ontario began to implement ICD10 (the tenth revision of the International Classification of Diseases) before our study period; however, the diagnosis codes recorded in the administrative databases used ICD9 for the duration of the study period. All records for individuals with at least one record with a code in Table 1 were included.
Linking the data by EHIN made it possible to obtain the anonymized individual-level temporal sequences of OHIP claims, hospital admissions, hospital discharges, and death, for Ontario residents during the data years investigated. First we filtered individuals with at least one influenza-like illness diagnosis in the OHIP database and aggregated to the earliest date within an influenza year (individuals with multiple claims in the full study period may be counted multiple times, but will only be counted once per influenza year). Then, we filtered all individuals with influenza-like illnesses that had a hospitalization record in DAD. We were therefore able to establish relationships between physician visits associated with influenza-like illness and hospitalization (even if influenza or pneumonia was listed in only one record of the sequence for a given patient). Note that individuals who have influenza-like illnesses and present with pneumonia are often given a diagnosis of pneumonia rather than influenza.
In principle, it was straightforward to construct the desired distributions, i.e.,
(i)the distribution of times from physician diagnosis to hospitalization;(ii)the distribution of times from hospitalization to discharge or death.
The primary practical challenge that we faced was working with the large event-data files.
In order to contribute to understanding of seasonal patterns of respiratory illnesses, we aggregated events according to diagnosis and billing date. Because multiple billings are frequently associated with the same individual during a single illness, a given individual is likely to contribute to counts in multiple categories on multiple days during any given illness. Thus, the aggregated counts provide a time series of health service utilization (which counts the frequency of each diagnosis observed per day) rather than an epidemic time series (which would count each case of a disease once).
We used SAS to filter, link and extract data from the MOHLTC administrative databases available to us, and R (version 3.5.3) for all analyses. R packages glmmTMB (version 0.2.3,) and depmixS4 (version 1.3-5,) were used to fit Gamma and finite mixture Poisson distributions to the delay distributions (see Additional file 1).
Over the four data years (2006–2010), there were a total of 1.34 billion OHIP billing records. Of these, 31.9 million (2.39%) contained at least one of the ARI or P&I diagnosis codes listed in Table 1; after restricting attention to this subset, there were 7.60 million unique individuals in the database. During the same period, 4.27 million hospital inpatient stays were recorded in the DAD.
Most individuals were associated with a small number of billing records. Table 2 provides a quantile summary of the number of billing records (with diagnosis codes listed in Table 1) per person. Given the size of the data set (7.6 million individuals), the upper 2.5% tail in Table 2 (individuals with >17 billing records in the data set) corresponds to approximately 190,000 individuals.
Table 3 lists precise counts of records and patients, broken down by diagnoses of pneumonia or influenza (not for all diagnoses in Table 1) and by year.
From Table 3 it is straightforward to estimate various outcome probabilities given specific events having occurred during the focal data year (Table 4). For example, P(Hospitalized | P&I diagnosis) is defined as the probability that an individual who was diagnosed with pneumonia or influenza during a given influenza year was hospitalized during that year; the reason for hospitalization may not have been related to pneumonia or influenza. Similarly, P(P&I diagnosis | Hospitalized) denotes the probability that someone who was hospitalized during the influenza year was diagnosed with pneumonia or influenza that year; the P&I diagnosis might have occurred before, during or after the hospitalization. The one exception is the final row of Table 4, where P(Died in hospital | Hospitalized) denotes the probability that someone who was hospitalized during a given influenza year died in hospital that year, which could have happened only after hospitalization (though potentially after multiple distinct hospitalizations); note that P(Hospitalized | Died in hospital) would not be exactly 1 because some individuals who died in hospital in a given influenza year were hospitalized before the beginning of that influenza year.
Service billing by day of week
The daily time series of billings show a small-amplitude weekly oscillation. Figure 1 shows that most acute respiratory illnesses have a characteristic weekly pattern of service billings (generally decreasing from Monday through Sunday, with a slight dip on Wednesday and a peak on Thursday). Pneumonia stands out as having smaller fluctuations over the course of the week.
Seasonal patterns by week
Figure 2 shows the weekly billings time series over each of the four influenza years in our data set. All the diagnoses display an annual oscillation. Pneumonia and tonsilitis have the least pronounced cycle while influenza has the most pronounced cycle. All diagnoses, including pneumonia, show a stronger seasonal peak during the 2009–2010 influenza season, which was dominated by pandemic H1N1. The dips just before the New Year are presumably driven by the holiday season (and associated delays in processing records), as is the case for well-known childhood disease time series.
We note that virological testing practices for influenza changed in Ontario during the 2009 pandemic. While fairly systematic testing was conducted for the first few months of the outbreak, testing restrictions were imposed in mid June 2009. The extent to which this policy change influenced the number of OHIP billing diagnoses is not clear.
Time delay distributions
Finally, we exploited the linked data to measure the elapsed time between various health-based outcomes. The resulting frequency distributions of these delays are shown in Fig. 3. We refer to the time of the first relevant diagnosis (influenza, pneumonia, or “other respiratory”) as the “service time”. The five panels of Fig. 3 show the frequency distributions of time delays for:
service to hospital admission,
hospital admission to hospital discharge,hospital admission to death,service to hospital discharge,service to death.
Figure 3 reveals that short delays (e.g., <30 days) are most common overall; the exceptions are a very long tail in the distribution of delays from service to admission, and a roughly flat distribution for influenza-related service to mortality (for which sample sizes may be too small to draw confident conclusions). Delay distributions are similar for influenza and “other respiratory” diseases; pneumonia generally has a longer time from admission to discharge or mortality. The latter pattern may be driven by the more substantial negative tail of delays from service to admission, which represents people who were admitted to hospital during the focal season but were first billed for services relating to one of the diagnostic codes listed in Table 1 some time after their admission.
Normalizing the distributions in Fig. 3 by the total counts in the diagnostic category yields the estimated probability densities in Fig. 4; these densities are normalized, and shown, on a 30 day timescale rather than the 100 day timescale presented in Fig. 3 (in order to focus on pairs of events that are more likely to be causally connected). The leveling-off of the influenza distribution at a higher value than the pneumonia and other respiratory diagnoses in most of the plots (admission to discharge, admission to mortality) is an artifact of the smaller number of influenza cases; if the total number of cases is 10,000, then the non-zero values of the distribution cannot go below 10−4.
Figure 5 summarizes and facilitates comparison of the various delay distributions. Delays corresponding to the 2.5%, 25%, 50%, 75%, 97.5% quantiles are interpolated from the cumulative delay distributions.
In addition to summarizing the delay distributions by their quantiles, we also fitted parametric distributional models (see Additional file 1). Gamma models, using only two parameters (shape and scale), did a poor job fitting the delay distributions, which tend to have both sharp peaks and heavy tails relative to a Gamma distribution. We also fitted four-component finite mixtures of Poisson distributions; with 7 parameters (four rate parameters and three parameters determining the relative weights of the mixture components) these captured much more of the overall pattern, but still left some features unexplained. Other models, such as a finite mixture of negative binomial distributions, may provide better summaries; how complex a summary is most useful (e.g. a simple Gamma vs. a many-parameter finite mixture) will depend on particular applications. Additional file 1 provides the full details of the delay distributions fits. Full delay data are avalible at https://zenodo.org/record/3340201 for researchers wishing to derive other summaries of the data.
Our study was restricted to the Canadian province of Ontario, a region where the population is primarily concentrated in a large megalopolis (the Greater Toronto Area). The extent to which the delay distributions we have estimated can be applied to regions other than Ontario is not known. However, it is reasonable to expect that they are similar in other parts of Canada, and in many other industrialized countries. Ideally, it would be better to repeat our study using data for particular populations of interest.
Forecasting of influenza epidemics and associated patterns of health care system utilization can be conducted with greater confidence using the empirically estimated delay distributions that we have presented here (Figs. 3, 4 and 5; data in Additional file 1). While quantifying delay distributions was our primary goal, we were able to obtain a number of statistics that are of independent interest (Table 4). For example, approximately 5% of people admitted to hospital died there, and if they died in hospital then the probability that they had been diagnosed with influenza or pneumonia within the year was 10%.
The primary challenge in conducting the kind of research presented here is the size of the data sets and the requirement to analyze them in a secure environment. As computing power and the sophistication and availability of software for analyzing “big data” increases, much more extensive studies of linked individual-level health-related databases should become feasible.