Fiveable

🥖Linear Modeling Theory Unit 14 Review

QR code for Linear Modeling Theory practice questions

14.1 Logistic Regression for Binary Outcomes

14.1 Logistic Regression for Binary Outcomes

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Logistic Regression for Binary Outcomes

Logistic regression predicts binary outcomes (yes/no, 0/1) by modeling the probability of an event occurring based on one or more predictor variables. Unlike ordinary linear regression, it doesn't try to predict a continuous number. Instead, it maps predictors to a probability between 0 and 1 through a logistic (sigmoid) function.

This technique shows up constantly in applied work: predicting whether a patient develops a disease, whether a customer churns, or whether a loan defaults. Understanding it well also sets you up for Poisson regression and other generalized linear models covered later in this unit.

Overview and Applications

A binary outcome variable has exactly two categories (success/failure, 0/1). Logistic regression models the probability that an observation falls into one of those categories, given a set of predictors.

The key idea is that the relationship between predictors and the outcome probability is not a straight line. Instead, it follows a sigmoid curve that naturally stays bounded between 0 and 1. This makes logistic regression far more appropriate for probability modeling than ordinary least squares, which could produce predicted probabilities below 0 or above 1.

Logistic regression is widely used in medical research (e.g., predicting disease presence), marketing (e.g., predicting purchase behavior), credit scoring, and the social sciences.

Model Characteristics and Assumptions

Logistic regression can handle both continuous and categorical predictor variables. Its core assumptions are:

  • Independence of observations. Each data point should be independent of the others.
  • No severe multicollinearity. Predictor variables shouldn't be highly correlated with each other, as this inflates standard errors and makes coefficient estimates unstable.
  • Linearity in the logit. The log-odds of the outcome must be linearly related to the predictor variables. This is the linearity assumption of logistic regression. It does not assume a linear relationship between predictors and the probability itself.

Note that logistic regression does not assume normally distributed errors or constant variance, which distinguishes it from ordinary linear regression.

Logistic Regression Equation Components

Overview and Applications, רגרסיה לוגיסטית – ויקיפדיה

The Logistic Regression Equation

The model works by linking predictors to the log-odds (logit) of the outcome. Here's how the pieces fit together:

Odds are the ratio of the probability of the event occurring to the probability of it not occurring: odds=p1p\text{odds} = \frac{p}{1 - p}

The logit is the natural logarithm of the odds: logit(p)=ln(p1p)\text{logit}(p) = \ln\left(\frac{p}{1 - p}\right)

The logistic regression equation models this logit as a linear combination of predictors:

logit(p)=β0+β1X1+β2X2++βkXk\text{logit}(p) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k

where pp is the probability of the outcome, β0\beta_0 is the intercept, and β1,β2,,βk\beta_1, \beta_2, \ldots, \beta_k are the regression coefficients for predictors X1,X2,,XkX_1, X_2, \ldots, X_k.

Interpreting Coefficients and Odds Ratios

  • Intercept (β0\beta_0): the log-odds of the outcome when all predictors equal zero.
  • Regression coefficients (βj\beta_j): the change in the log-odds of the outcome for a one-unit increase in XjX_j, holding all other predictors constant.

Log-odds aren't intuitive on their own, so coefficients are typically exponentiated to produce odds ratios: ORj=eβj\text{OR}_j = e^{\beta_j}

  • An odds ratio greater than 1 means the odds of the outcome increase as XjX_j increases.
  • An odds ratio less than 1 means the odds decrease.
  • An odds ratio of exactly 1 means XjX_j has no effect on the odds.

For example, if β1=0.7\beta_1 = 0.7, then e0.72.01e^{0.7} \approx 2.01. A one-unit increase in X1X_1 roughly doubles the odds of the outcome.

Getting predicted probabilities. To convert back from log-odds to a probability, use the inverse logit function:

p=11+e(β0+β1X1+β2X2++βkXk)p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k)}}

This is the sigmoid function, and it's what keeps predicted probabilities bounded between 0 and 1.

Maximum Likelihood Estimation for Logistic Regression

Overview and Applications, Logistic Regression

Estimation Method

Logistic regression coefficients are estimated using maximum likelihood estimation (MLE), not ordinary least squares. MLE finds the parameter values that make the observed data most probable under the model.

Here's the logic in steps:

  1. Each observation is treated as a Bernoulli trial: the outcome is either 0 or 1, with a probability pip_i determined by the logistic regression equation and that observation's predictor values.
  2. The likelihood function is the product of these individual probabilities across all observations. It represents how "likely" the observed data are, given a particular set of coefficients.
  3. In practice, we work with the log-likelihood (the natural log of the likelihood) because sums are easier to optimize than products. Maximizing the log-likelihood is mathematically equivalent to maximizing the likelihood.
  4. There is no closed-form solution, so iterative algorithms find the maximum. The most common are the Newton-Raphson method and Fisher scoring (iteratively reweighted least squares).

Optimization and Standard Errors

The iterative process works like this:

  1. Start with initial parameter estimates (often all zeros).
  2. Update the estimates using the gradient and curvature of the log-likelihood function.
  3. Repeat until convergence: the change in the log-likelihood or in the parameter estimates drops below a specified threshold.

Once the algorithm converges, standard errors for each coefficient are obtained from the inverse of the observed information matrix (the matrix of second derivatives of the log-likelihood) evaluated at the MLE estimates. These standard errors are essential for constructing confidence intervals and running hypothesis tests.

Predictor Significance in Logistic Regression

Wald Test

The Wald test assesses whether an individual predictor has a statistically significant effect on the outcome. It tests the null hypothesis that a coefficient equals zero (H0:βj=0H_0: \beta_j = 0), meaning the predictor has no effect.

The test statistic is:

W=(β^jSE(β^j))2W = \left(\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\right)^2

where β^j\hat{\beta}_j is the estimated coefficient and SE(β^j)SE(\hat{\beta}_j) is its standard error.

Under H0H_0, this statistic follows a chi-square distribution with 1 degree of freedom. You then compare the resulting p-value to your significance level (commonly 0.05):

  • If p<0.05p < 0.05, reject H0H_0. The predictor is statistically significant.
  • If p0.05p \geq 0.05, you fail to reject H0H_0. There isn't sufficient evidence that the predictor affects the outcome.

One caution: the Wald test can be unreliable when coefficient estimates are very large (it tends to underestimate significance in those cases). The likelihood ratio test, which compares nested models, is generally more robust but requires fitting an additional model.

Confidence Intervals

A 95% confidence interval for a coefficient is:

β^j±zα/2×SE(β^j)\hat{\beta}_j \pm z_{\alpha/2} \times SE(\hat{\beta}_j)

where zα/2=1.96z_{\alpha/2} = 1.96 for a 95% confidence level.

If this interval does not contain zero, the predictor is statistically significant at the 5% level. You can also exponentiate the endpoints to get a confidence interval for the odds ratio, which is often more interpretable.

Keep in mind that statistical significance doesn't automatically mean practical importance. A predictor can be statistically significant but have a tiny odds ratio that barely matters in context. Always consider the magnitude of the effect alongside the p-value.