Outbreak investigations depend on careful data collection and analysis to find the source and spread of a disease. Without solid data, you can't distinguish a true risk factor from a coincidence. This section covers what data gets collected, how it's gathered, and the key analytical tools used to make sense of it all.

Data Collection in Outbreak Investigations

Types of outbreak investigation data

Investigators collect three broad categories of data. Each serves a different purpose in building the full picture of an outbreak.

Case information describes who is sick and how they're sick:

Demographic data (age, sex, occupation) helps identify patterns across groups
Clinical data (symptoms, onset date, duration) characterizes the illness itself
Laboratory results confirm the diagnosis and identify the specific pathogen
Travel history traces where cases may have been exposed

Exposure data focuses on what cases came into contact with before getting sick:

Food consumption history can pinpoint contaminated items, sometimes down to a specific restaurant or meal
Water sources (wells, municipal supply) may reveal contamination
Animal contact suggests zoonotic transmission, such as from petting zoos or livestock
Environmental exposures uncover non-food sources like swimming pools or contaminated soil

Environmental samples provide direct physical evidence of pathogens in the environment:

Food samples from suspected items (e.g., raw chicken from a supplier)
Water samples tested for contamination (e.g., $E. coli$ levels in a well)
Surface swabs from high-touch areas like doorknobs or kitchen counters
Air samples for airborne pathogens (e.g., $Legionella$ in cooling towers)

Types of outbreak investigation data, Laboratory response checklist for infectious disease outbreaks: Preparedness and response ...

Methods of outbreak data collection

How you collect data matters just as much as what you collect. Different methods suit different situations.

Interviews are the backbone of most outbreak investigations:

Face-to-face interviews allow detailed questioning and let the interviewer observe nonverbal cues
Telephone interviews reach geographically dispersed cases more quickly
Structured questionnaires ensure every case gets asked the same questions in the same order, which keeps data consistent

Surveys help gather information from larger groups:

Online surveys enable rapid data collection from large populations
Paper-based surveys work better in areas with limited internet access
Household surveys capture family-level exposure information, which is useful when transmission may occur within homes

Medical record reviews fill in clinical details that patients may not remember or report accurately:

Electronic health records provide comprehensive patient histories
Paper-based charts offer detailed clinical notes
Laboratory reports confirm diagnoses and identify specific pathogens

Observational methods let investigators see conditions firsthand:

Environmental assessments identify hazards at the outbreak setting (e.g., a restaurant kitchen)
Workplace inspections reveal occupational exposure risks

Data Analysis in Outbreak Investigations

Types of outbreak investigation data, Frontiers | Data and Digital Solutions to Support Surveillance Strategies in the Context of the ...

Tools for organizing outbreak data

Before any statistical analysis, raw data needs to be organized. Two tools are standard in virtually every outbreak investigation.

Line lists are spreadsheets where each row represents one case and each column represents a variable (name, age, symptom onset date, exposures, lab results, etc.). They let you sort and filter data quickly to spot patterns. For example, sorting by onset date might reveal a cluster of cases on the same day, suggesting a common exposure event.

Epidemic curves (epi curves) are bar charts that plot the number of new cases (y-axis) over time (x-axis). They're one of the most useful visual tools in outbreak investigation because the shape of the curve tells you something about transmission:

A point source outbreak (single shared exposure) produces a tight peak
A propagated outbreak (person-to-person spread) shows successive waves
A continuous source outbreak shows a plateau that persists until the source is removed

The time units on the x-axis depend on the disease's incubation period (hours for norovirus, weeks for hepatitis A).

Statistical analysis of outbreak data

Statistical analysis helps you move from "these cases seem related" to "this exposure is significantly associated with illness."

Attack rates measure how frequently disease occurs in a specific group:

$\text{Attack Rate} = \frac{\text{Number of cases}}{\text{Population at risk}} \times 100$

You calculate attack rates separately for exposed and unexposed groups. If 40 out of 100 people who ate the potato salad got sick (40% attack rate) versus 5 out of 80 who didn't eat it (6.25% attack rate), the potato salad looks suspicious.

Relative risk (RR) directly compares those attack rates:

$\text{Relative Risk} = \frac{\text{Attack rate in exposed group}}{\text{Attack rate in unexposed group}}$

RR = 1 means no difference in risk between groups
RR > 1 means the exposed group has higher risk (e.g., RR of 2.5 means 2.5 times the risk)
RR < 1 means the exposure may actually be protective

Odds ratio (OR) is used in case-control studies, where you can't calculate true incidence rates because you start by selecting cases and controls rather than following a population over time. When the disease is rare, the OR approximates the RR.

Chi-square test assesses whether an observed association between exposure and disease is statistically significant or could have occurred by chance alone.

Logistic regression handles more complex situations where multiple exposures might contribute to disease. It lets you evaluate several risk factors simultaneously while controlling for confounders.

Data management in outbreak investigations

Good analysis is only possible with good data management. Errors introduced during data entry or storage can distort your results.

Data management practices:

A centralized database keeps all data accessible and consistent across the investigation team
Standardized data entry protocols (e.g., consistent date formats like MM/DD/YYYY) prevent confusion
Data cleaning procedures catch errors such as outliers, duplicates, and missing values
Secure storage with encryption and regular backups protects sensitive patient information

Quality control measures catch mistakes before they affect analysis:

Double data entry (two people enter the same data independently, then compare) reduces transcription errors
Validation checks ensure values meet predefined criteria (e.g., age must be between 0 and 120)
Consistency checks flag logically conflicting entries (e.g., pregnancy recorded for a male patient)
Range checks flag implausible numerical values (e.g., body temperature of 50°C)

Training and documentation keep the process reproducible:

Standardized interview techniques reduce interviewer bias
Codebooks define every variable so anyone reviewing the data later understands what was recorded and how
Methodology documentation records collection procedures for future reference and for writing up the investigation

Regular data audits throughout the investigation help identify and correct errors early, and assess whether data collection is complete enough to support reliable conclusions.

2,589 studying →