Logistic Regression for Binary Outcomes
Logistic regression predicts binary outcomes (yes/no, 0/1) by modeling the probability of an event occurring based on one or more predictor variables. Unlike ordinary linear regression, it doesn't try to predict a continuous number. Instead, it maps predictors to a probability between 0 and 1 through a logistic (sigmoid) function.
This technique shows up constantly in applied work: predicting whether a patient develops a disease, whether a customer churns, or whether a loan defaults. Understanding it well also sets you up for Poisson regression and other generalized linear models covered later in this unit.
Overview and Applications
A binary outcome variable has exactly two categories (success/failure, 0/1). Logistic regression models the probability that an observation falls into one of those categories, given a set of predictors.
The key idea is that the relationship between predictors and the outcome probability is not a straight line. Instead, it follows a sigmoid curve that naturally stays bounded between 0 and 1. This makes logistic regression far more appropriate for probability modeling than ordinary least squares, which could produce predicted probabilities below 0 or above 1.
Logistic regression is widely used in medical research (e.g., predicting disease presence), marketing (e.g., predicting purchase behavior), credit scoring, and the social sciences.
Model Characteristics and Assumptions
Logistic regression can handle both continuous and categorical predictor variables. Its core assumptions are:
- Independence of observations. Each data point should be independent of the others.
- No severe multicollinearity. Predictor variables shouldn't be highly correlated with each other, as this inflates standard errors and makes coefficient estimates unstable.
- Linearity in the logit. The log-odds of the outcome must be linearly related to the predictor variables. This is the linearity assumption of logistic regression. It does not assume a linear relationship between predictors and the probability itself.
Note that logistic regression does not assume normally distributed errors or constant variance, which distinguishes it from ordinary linear regression.
Logistic Regression Equation Components

The Logistic Regression Equation
The model works by linking predictors to the log-odds (logit) of the outcome. Here's how the pieces fit together:
Odds are the ratio of the probability of the event occurring to the probability of it not occurring:
The logit is the natural logarithm of the odds:
The logistic regression equation models this logit as a linear combination of predictors:
where is the probability of the outcome, is the intercept, and are the regression coefficients for predictors .
Interpreting Coefficients and Odds Ratios
- Intercept (): the log-odds of the outcome when all predictors equal zero.
- Regression coefficients (): the change in the log-odds of the outcome for a one-unit increase in , holding all other predictors constant.
Log-odds aren't intuitive on their own, so coefficients are typically exponentiated to produce odds ratios:
- An odds ratio greater than 1 means the odds of the outcome increase as increases.
- An odds ratio less than 1 means the odds decrease.
- An odds ratio of exactly 1 means has no effect on the odds.
For example, if , then . A one-unit increase in roughly doubles the odds of the outcome.
Getting predicted probabilities. To convert back from log-odds to a probability, use the inverse logit function:
This is the sigmoid function, and it's what keeps predicted probabilities bounded between 0 and 1.
Maximum Likelihood Estimation for Logistic Regression

Estimation Method
Logistic regression coefficients are estimated using maximum likelihood estimation (MLE), not ordinary least squares. MLE finds the parameter values that make the observed data most probable under the model.
Here's the logic in steps:
- Each observation is treated as a Bernoulli trial: the outcome is either 0 or 1, with a probability determined by the logistic regression equation and that observation's predictor values.
- The likelihood function is the product of these individual probabilities across all observations. It represents how "likely" the observed data are, given a particular set of coefficients.
- In practice, we work with the log-likelihood (the natural log of the likelihood) because sums are easier to optimize than products. Maximizing the log-likelihood is mathematically equivalent to maximizing the likelihood.
- There is no closed-form solution, so iterative algorithms find the maximum. The most common are the Newton-Raphson method and Fisher scoring (iteratively reweighted least squares).
Optimization and Standard Errors
The iterative process works like this:
- Start with initial parameter estimates (often all zeros).
- Update the estimates using the gradient and curvature of the log-likelihood function.
- Repeat until convergence: the change in the log-likelihood or in the parameter estimates drops below a specified threshold.
Once the algorithm converges, standard errors for each coefficient are obtained from the inverse of the observed information matrix (the matrix of second derivatives of the log-likelihood) evaluated at the MLE estimates. These standard errors are essential for constructing confidence intervals and running hypothesis tests.
Predictor Significance in Logistic Regression
Wald Test
The Wald test assesses whether an individual predictor has a statistically significant effect on the outcome. It tests the null hypothesis that a coefficient equals zero (), meaning the predictor has no effect.
The test statistic is:
where is the estimated coefficient and is its standard error.
Under , this statistic follows a chi-square distribution with 1 degree of freedom. You then compare the resulting p-value to your significance level (commonly 0.05):
- If , reject . The predictor is statistically significant.
- If , you fail to reject . There isn't sufficient evidence that the predictor affects the outcome.
One caution: the Wald test can be unreliable when coefficient estimates are very large (it tends to underestimate significance in those cases). The likelihood ratio test, which compares nested models, is generally more robust but requires fitting an additional model.
Confidence Intervals
A 95% confidence interval for a coefficient is:
where for a 95% confidence level.
If this interval does not contain zero, the predictor is statistically significant at the 5% level. You can also exponentiate the endpoints to get a confidence interval for the odds ratio, which is often more interpretable.
Keep in mind that statistical significance doesn't automatically mean practical importance. A predictor can be statistically significant but have a tiny odds ratio that barely matters in context. Always consider the magnitude of the effect alongside the p-value.