Regression models let you analyze relationships between variables while controlling for confounders. In epidemiology, three types come up constantly: linear regression for continuous outcomes, logistic regression for binary outcomes, and survival analysis for time-to-event data. Each produces a different measure of association (coefficients, odds ratios, hazard ratios), and knowing when to use which model is essential for interpreting study results.
Linear Regression for Continuous Outcomes
Linear Regression Modeling
Linear regression models the relationship between a continuous outcome (dependent variable) and one or more predictors (independent variables). You'd use it when your outcome is something measurable on a continuous scale, like blood pressure, BMI, or cholesterol level.
The simple linear regression equation is:
- = the outcome variable
- = the predictor variable
- = the intercept (the predicted value of when )
- = the slope, or regression coefficient (the predicted change in for each one-unit increase in )
- = the error term (random variation not explained by the model)
Multiple linear regression extends this to include two or more predictors:
This is where the epidemiologic value really kicks in. By adding variables to the model, you can adjust for potential confounders. For example, if you're studying the effect of physical activity on systolic blood pressure, you can include age and smoking status as additional predictors to isolate the independent effect of activity level.
Assumptions and Estimation
Linear regression relies on four key assumptions. If these are violated, your results may be unreliable:
- Linearity — The relationship between each predictor and the outcome is linear. A scatterplot of residuals vs. fitted values should show no clear pattern.
- Independence — Observations are independent of each other. This can be violated in clustered data (e.g., patients within the same hospital).
- Homoscedasticity — The variance of residuals is constant across all levels of the predictors. If the spread of residuals fans out as predicted values increase, you have heteroscedasticity.
- Normality of residuals — The residuals (not the raw data) should be approximately normally distributed. Check this with a Q-Q plot or histogram of residuals.
Coefficients are estimated using ordinary least squares (OLS), which finds the line that minimizes the sum of squared residuals.
The coefficient of determination () tells you what proportion of the variance in the outcome is explained by your model. It ranges from 0 to 1. An of 0.35 means the predictors explain 35% of the variation in the outcome. In epidemiology, values are often modest because health outcomes are influenced by many unmeasured factors.
Logistic Regression for Binary Outcomes
Logistic Regression Modeling
When your outcome is binary (yes/no, disease/no disease, dead/alive), linear regression won't work because it can predict values outside the 0–1 range. Logistic regression solves this by modeling the probability of the outcome occurring.
The model uses the logistic function to keep predicted probabilities between 0 and 1:
Common epidemiologic examples include modeling the probability of developing diabetes based on age, BMI, and family history, or modeling the probability of death within 30 days of hospital admission.
Odds Ratios and Model Fit
The raw coefficients () in logistic regression represent the change in the log odds of the outcome for a one-unit increase in the predictor, holding other variables constant. Log odds aren't intuitive, so we exponentiate them.
The odds ratio (OR) for a predictor is calculated as . This is the key output you'll interpret:
- : no association between the predictor and the outcome
- : the predictor is associated with higher odds of the outcome
- : the predictor is associated with lower odds of the outcome
For example, if a logistic regression of lung cancer on smoking status yields , then . This means smokers have 4 times the odds of lung cancer compared to non-smokers, after adjusting for other variables in the model.
Model fit for logistic regression is assessed differently than for linear regression since there's no straightforward :
- Likelihood ratio test — Compares the fit of your model to a model with no predictors
- Hosmer-Lemeshow test — Tests whether predicted probabilities match observed frequencies across groups
- Pseudo- values (e.g., McFadden's, Cox and Snell) — Approximate the concept of but should be interpreted cautiously; they don't have the same "proportion of variance explained" meaning
- Wald test — Tests whether individual coefficients are significantly different from zero

Survival Analysis for Time-to-Event Data
Kaplan-Meier Method and Log-Rank Test
Survival analysis is used when your outcome is the time until an event occurs, such as time from diagnosis to death, time from treatment to disease recurrence, or time from enrollment to dropout. What makes this different from simply measuring "did the event happen?" is that it accounts for when it happened and handles incomplete follow-up.
The Kaplan-Meier (KM) method is a non-parametric way to estimate the survival function , which gives the probability of surviving (remaining event-free) beyond time . The KM curve is a step function that drops at each time point when an event occurs. Steeper drops indicate periods of higher event rates.
The log-rank test compares survival curves between two or more groups (e.g., treatment vs. placebo). It tests the null hypothesis that there's no difference in survival between groups across the entire follow-up period. It's a good starting point, but it doesn't adjust for confounders.
Cox Proportional Hazards Model and Censoring
The Cox proportional hazards model is the regression equivalent for survival data. It lets you examine the effect of multiple predictors on survival time simultaneously, which means you can adjust for confounders just like in linear or logistic regression.
The key output is the hazard ratio (HR), which represents the relative rate of the event occurring at any given time:
- : no difference in hazard between groups
- : the predictor is associated with a higher rate of the event (worse survival)
- : the predictor is associated with a lower rate of the event (better survival)
For example, if a Cox model comparing a new drug to placebo yields , the treatment group has 40% lower hazard of death at any given time compared to placebo.
The Cox model is called semi-parametric because it makes no assumption about the shape of the baseline hazard function, but it does assume that hazard ratios remain constant over time. This is the proportional hazards assumption, and it needs to be checked (using Schoenfeld residuals or log-log plots).
Censoring is a defining feature of survival analysis. It occurs when you don't observe the exact event time for some participants:
- Right-censoring (most common) — The study ends before the participant experiences the event, or they're lost to follow-up. You know they survived at least until a certain time, but not how much longer.
- Left-censoring — The event already occurred before the participant entered the study, so the exact time is unknown.
Survival analysis methods are specifically designed to handle censored observations without discarding them, which is a major advantage over simply analyzing binary outcomes at a fixed time point.
Regression Model Interpretation and Diagnostics
Coefficient Interpretation and Significance
Interpreting coefficients correctly depends on the type of regression:
| Model | Coefficient Meaning | Exponentiated? |
|---|---|---|
| Linear | Change in outcome per one-unit increase in predictor | No |
| Logistic | Change in log odds per one-unit increase in predictor | Yes → gives OR |
| Cox | Change in log hazard per one-unit increase in predictor | Yes → gives HR |
| In all three models, the interpretation assumes other variables in the model are held constant. This "adjusted" interpretation is what makes regression so valuable for controlling confounding. |
Statistical significance of individual coefficients is assessed using:
- t-tests in linear regression
- Wald tests in logistic regression and Cox models
- A p-value tests the null hypothesis that (no association)
Always report confidence intervals alongside p-values. A 95% CI for an odds ratio that ranges from 0.9 to 5.2 tells you much more than just knowing . The CI gives you the range of plausible values for the true effect.
Model Fit and Diagnostic Tests
Different models require different diagnostics, but the core questions are the same: Does the model fit the data well? Are the assumptions met? Are any observations unduly influencing the results?
For linear regression:
- Residual plots (residuals vs. fitted values, Q-Q plots) to check linearity, homoscedasticity, and normality
- and adjusted for overall model fit
For logistic regression:
- Hosmer-Lemeshow test for calibration
- Pseudo- values and likelihood ratio tests for model fit
- ROC curves and area under the curve (AUC) for discrimination
For Cox regression:
- Schoenfeld residuals to test the proportional hazards assumption
- Martingale residuals to check functional form of continuous predictors
Across all models:
- Variance inflation factor (VIF) detects multicollinearity. A VIF above 5–10 suggests that predictors are too highly correlated, which inflates standard errors and makes individual coefficients unstable.
- Cook's distance identifies influential observations, meaning individual data points that disproportionately affect the model's results. High Cook's distance values warrant closer inspection.
- Cross-validation (e.g., k-fold) tests whether the model generalizes to new data or is overfitting to the sample at hand.