Data Collection and Analysis in Outbreaks
Outbreak investigations depend on careful data collection and analysis to find the source and spread of a disease. Without solid data, you can't distinguish a true risk factor from a coincidence. This section covers what data gets collected, how it's gathered, and the key analytical tools used to make sense of it all.
Data Collection in Outbreak Investigations
Types of outbreak investigation data
Investigators collect three broad categories of data. Each serves a different purpose in building the full picture of an outbreak.
Case information describes who is sick and how they're sick:
- Demographic data (age, sex, occupation) helps identify patterns across groups
- Clinical data (symptoms, onset date, duration) characterizes the illness itself
- Laboratory results confirm the diagnosis and identify the specific pathogen
- Travel history traces where cases may have been exposed
Exposure data focuses on what cases came into contact with before getting sick:
- Food consumption history can pinpoint contaminated items, sometimes down to a specific restaurant or meal
- Water sources (wells, municipal supply) may reveal contamination
- Animal contact suggests zoonotic transmission, such as from petting zoos or livestock
- Environmental exposures uncover non-food sources like swimming pools or contaminated soil
Environmental samples provide direct physical evidence of pathogens in the environment:
- Food samples from suspected items (e.g., raw chicken from a supplier)
- Water samples tested for contamination (e.g., levels in a well)
- Surface swabs from high-touch areas like doorknobs or kitchen counters
- Air samples for airborne pathogens (e.g., in cooling towers)

Methods of outbreak data collection
How you collect data matters just as much as what you collect. Different methods suit different situations.
Interviews are the backbone of most outbreak investigations:
- Face-to-face interviews allow detailed questioning and let the interviewer observe nonverbal cues
- Telephone interviews reach geographically dispersed cases more quickly
- Structured questionnaires ensure every case gets asked the same questions in the same order, which keeps data consistent
Surveys help gather information from larger groups:
- Online surveys enable rapid data collection from large populations
- Paper-based surveys work better in areas with limited internet access
- Household surveys capture family-level exposure information, which is useful when transmission may occur within homes
Medical record reviews fill in clinical details that patients may not remember or report accurately:
- Electronic health records provide comprehensive patient histories
- Paper-based charts offer detailed clinical notes
- Laboratory reports confirm diagnoses and identify specific pathogens
Observational methods let investigators see conditions firsthand:
- Environmental assessments identify hazards at the outbreak setting (e.g., a restaurant kitchen)
- Workplace inspections reveal occupational exposure risks
Data Analysis in Outbreak Investigations

Tools for organizing outbreak data
Before any statistical analysis, raw data needs to be organized. Two tools are standard in virtually every outbreak investigation.
Line lists are spreadsheets where each row represents one case and each column represents a variable (name, age, symptom onset date, exposures, lab results, etc.). They let you sort and filter data quickly to spot patterns. For example, sorting by onset date might reveal a cluster of cases on the same day, suggesting a common exposure event.
Epidemic curves (epi curves) are bar charts that plot the number of new cases (y-axis) over time (x-axis). They're one of the most useful visual tools in outbreak investigation because the shape of the curve tells you something about transmission:
- A point source outbreak (single shared exposure) produces a tight peak
- A propagated outbreak (person-to-person spread) shows successive waves
- A continuous source outbreak shows a plateau that persists until the source is removed
The time units on the x-axis depend on the disease's incubation period (hours for norovirus, weeks for hepatitis A).
Statistical analysis of outbreak data
Statistical analysis helps you move from "these cases seem related" to "this exposure is significantly associated with illness."
Attack rates measure how frequently disease occurs in a specific group:
You calculate attack rates separately for exposed and unexposed groups. If 40 out of 100 people who ate the potato salad got sick (40% attack rate) versus 5 out of 80 who didn't eat it (6.25% attack rate), the potato salad looks suspicious.
Relative risk (RR) directly compares those attack rates:
- RR = 1 means no difference in risk between groups
- RR > 1 means the exposed group has higher risk (e.g., RR of 2.5 means 2.5 times the risk)
- RR < 1 means the exposure may actually be protective
Odds ratio (OR) is used in case-control studies, where you can't calculate true incidence rates because you start by selecting cases and controls rather than following a population over time. When the disease is rare, the OR approximates the RR.
Chi-square test assesses whether an observed association between exposure and disease is statistically significant or could have occurred by chance alone.
Logistic regression handles more complex situations where multiple exposures might contribute to disease. It lets you evaluate several risk factors simultaneously while controlling for confounders.
Data management in outbreak investigations
Good analysis is only possible with good data management. Errors introduced during data entry or storage can distort your results.
Data management practices:
- A centralized database keeps all data accessible and consistent across the investigation team
- Standardized data entry protocols (e.g., consistent date formats like MM/DD/YYYY) prevent confusion
- Data cleaning procedures catch errors such as outliers, duplicates, and missing values
- Secure storage with encryption and regular backups protects sensitive patient information
Quality control measures catch mistakes before they affect analysis:
- Double data entry (two people enter the same data independently, then compare) reduces transcription errors
- Validation checks ensure values meet predefined criteria (e.g., age must be between 0 and 120)
- Consistency checks flag logically conflicting entries (e.g., pregnancy recorded for a male patient)
- Range checks flag implausible numerical values (e.g., body temperature of 50°C)
Training and documentation keep the process reproducible:
- Standardized interview techniques reduce interviewer bias
- Codebooks define every variable so anyone reviewing the data later understands what was recorded and how
- Methodology documentation records collection procedures for future reference and for writing up the investigation
Regular data audits throughout the investigation help identify and correct errors early, and assess whether data collection is complete enough to support reliable conclusions.