upgrade
upgrade

🎲Data Science Statistics

Key Concepts in Inferential Statistics

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Inferential statistics is the bridge between your sample data and the broader population you actually care about. Every time you see a poll's "margin of error," read about a drug trial's results, or encounter a claim that one algorithm outperforms another, you're seeing inferential statistics at work. In data science, you're constantly being tested on your ability to quantify uncertainty, compare groups rigorously, and draw defensible conclusions from limited data.

The concepts here aren't just isolated techniques—they form an interconnected toolkit. Hypothesis testing and confidence intervals are two sides of the same coin. Point estimation feeds into maximum likelihood, which connects to Bayesian inference. Understanding these relationships is what separates memorization from mastery. Don't just learn what each method does—know when to use it, what assumptions it requires, and how it relates to the others.


Estimation: From Sample to Population

Before you can test anything, you need to estimate population parameters from your sample. These methods answer the fundamental question: given what I observed, what can I infer about the true underlying value?

Point Estimation

  • Single-value estimates—provides one "best guess" for a population parameter, such as using xˉ\bar{x} to estimate μ\mu
  • Common estimators include the sample mean, sample proportion, and sample variance, each targeting its population counterpart
  • No uncertainty information—point estimates alone don't tell you how confident you should be in that single number

Confidence Intervals

  • Range of plausible values—a 95% CI means if you repeated sampling many times, about 95% of intervals would contain the true parameter
  • Width reflects precision—narrower intervals indicate more certainty, driven by larger sample sizes and lower variability
  • Interpretation matters—the parameter is fixed; it's the interval that's random across repeated samples

Maximum Likelihood Estimation

  • Finds parameters that maximize L(θdata)L(\theta | \text{data})—the likelihood function measures how probable your observed data is under different parameter values
  • Asymptotically optimal—MLEs are consistent, efficient, and approximately normal for large samples
  • Foundation for advanced models—used in logistic regression, GLMs, survival analysis, and most modern statistical software

Compare: Point Estimation vs. Confidence Intervals—both estimate population parameters, but point estimates give a single value while CIs quantify the uncertainty around that estimate. FRQs often ask you to interpret CIs correctly—remember, it's about the procedure's reliability, not the probability that the parameter falls in a specific interval.


Hypothesis Testing Framework

Hypothesis testing formalizes the process of using data to make decisions. The core logic: assume nothing interesting is happening (null hypothesis), then ask whether your data are surprising enough to reject that assumption.

Hypothesis Testing

  • Null vs. alternativeH0H_0 represents "no effect" or "no difference"; HaH_a is what you're trying to find evidence for
  • Significance level α\alpha—typically 0.05, this is your threshold for how much Type I error (false positive) risk you'll accept
  • p-value interpretation—the probability of observing data this extreme if H0H_0 were true; small p-values suggest H0H_0 is unlikely

Power Analysis

  • Determines required sample size—calculates how many observations you need to detect a true effect with high probability
  • Balances four quantities—effect size, α\alpha, sample size nn, and power (1β1 - \beta); fixing three determines the fourth
  • Minimizes Type II errors—ensures your study can actually detect meaningful effects rather than being underpowered

Effect Size Estimation

  • Quantifies practical significance—statistical significance doesn't mean the effect matters; effect size tells you how much it matters
  • Common measures include Cohen's dd (standardized mean difference), r2r^2 (variance explained), and η2\eta^2 (for ANOVA)
  • Essential for interpretation—a tiny effect can be statistically significant with large nn; always report effect sizes alongside p-values

Compare: p-values vs. Effect Sizes—p-values tell you whether an effect exists, while effect sizes tell you whether it's meaningful. A study with n=100,000n = 100{,}000 might find p<0.001p < 0.001 for a practically negligible difference. If an FRQ asks about study interpretation, address both.


Comparing Groups: Tests for Means and Variances

When you need to determine whether groups differ, these tests provide the statistical machinery. The choice of test depends on how many groups you're comparing, whether data are paired, and what assumptions you can justify.

t-Tests

  • Compares means of two groups—independent t-tests for separate groups, paired t-tests for matched or repeated measurements
  • Test statistic t=xˉ1xˉ2SEt = \frac{\bar{x}_1 - \bar{x}_2}{SE}—measures how many standard errors apart the means are
  • Assumptions—normality (relaxed for large nn) and equal variances (Welch's t-test handles unequal variances)

z-Tests

  • Used when σ\sigma is known or nn is large—relies on the standard normal distribution rather than the t-distribution
  • Test statistic z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}—appropriate for proportions and large-sample means
  • Converges with t-test—as nn \to \infty, the t-distribution approaches the standard normal

Analysis of Variance (ANOVA)

  • Compares means across 3+ groups simultaneously—avoids inflated Type I error from multiple t-tests
  • F-statistic F=MSbetweenMSwithinF = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}—ratio of between-group variance to within-group variance
  • Assumptions—normality, independence, and homogeneity of variances (Levene's test can check the latter)

F-Tests

  • Compares variances between groups—tests whether populations have equal spread, not just equal centers
  • Underlies ANOVA and regression—the F-statistic in ANOVA is an F-test; regression uses it to test overall model significance
  • Sensitive to non-normality—violations of normality affect F-tests more than violations affect t-tests

Compare: t-Test vs. ANOVA—both compare means, but t-tests handle exactly two groups while ANOVA handles three or more. Running multiple t-tests inflates your false positive rate; ANOVA controls this with a single omnibus test. Follow-up with post-hoc tests (Tukey, Bonferroni) to identify which groups differ.


Categorical Data and Associations

Not all data are continuous. When you're working with counts, categories, or proportions, these methods test for relationships and goodness of fit. The key insight: compare what you observed to what you'd expect if variables were independent.

Chi-Square Tests

  • Tests association between categorical variables—compares observed cell counts to expected counts under independence
  • Test statistic χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}—large values indicate observed data deviate substantially from expectation
  • Two main uses—goodness-of-fit (does data match a distribution?) and test of independence (are variables related?)

Compare: Chi-Square vs. t-Test—chi-square tests work with categorical data and frequencies, while t-tests require continuous outcomes. If your dependent variable is "pass/fail" or "category A/B/C," reach for chi-square. If it's a measurement like height or score, use t-tests.


Modeling Relationships

Regression methods go beyond "is there a difference?" to ask "how are variables related, and can we predict outcomes?" These techniques model the functional form of relationships and quantify how predictors influence responses.

Regression Analysis

  • Models Y=f(X)+ϵY = f(X) + \epsilon—linear regression assumes Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon; logistic regression models log-odds for binary outcomes
  • Coefficients have interpretations—in linear regression, β1\beta_1 is the expected change in YY for a one-unit increase in XX
  • Inference on coefficients—t-tests assess whether individual βj0\beta_j \neq 0; F-tests assess overall model significance

Compare: ANOVA vs. Regression—mathematically equivalent for comparing group means (ANOVA is regression with dummy variables), but regression extends to continuous predictors and complex models. Use ANOVA language for experimental designs, regression language for predictive modeling.


Alternative Inference Frameworks

Classical frequentist methods aren't the only game in town. These approaches either relax distributional assumptions or incorporate prior knowledge, expanding your inferential toolkit.

Bayesian Inference

  • Updates beliefs with data—uses Bayes' theorem: P(θdata)P(dataθ)P(θ)P(\theta | \text{data}) \propto P(\text{data} | \theta) \cdot P(\theta)
  • Prior + likelihood = posterior—your prior beliefs about parameters combine with observed evidence to produce updated probabilities
  • Naturally quantifies uncertainty—posterior distributions give full probability statements about parameters, not just point estimates

Non-parametric Tests

  • No distributional assumptions—don't require normality; work with ranks rather than raw values
  • Key examples—Mann-Whitney U (alternative to independent t-test), Wilcoxon signed-rank (paired t-test), Kruskal-Wallis (ANOVA)
  • Robust but less powerful—sacrifice some ability to detect true effects in exchange for validity under violated assumptions

Bootstrapping

  • Resamples with replacement—creates thousands of "new" datasets from your original sample to approximate the sampling distribution
  • Estimates SEs and CIs without formulas—particularly valuable when theoretical distributions are unknown or complex
  • Handles unusual estimators—works for medians, ratios, or any statistic where closed-form standard errors don't exist

Compare: Frequentist vs. Bayesian—frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random variables with distributions. Frequentist CIs say "95% of intervals constructed this way contain the true value"; Bayesian credible intervals say "there's a 95% probability the parameter lies here." Know which framework a problem assumes.


Quick Reference Table

ConceptBest Examples
Point vs. Interval EstimationPoint Estimation, Confidence Intervals, MLE
Testing FrameworkHypothesis Testing, Power Analysis, Effect Size
Comparing Two Meanst-Tests, z-Tests
Comparing 3+ MeansANOVA, F-Tests
Categorical DataChi-Square Tests
Modeling RelationshipsRegression Analysis
Distribution-Free MethodsNon-parametric Tests, Bootstrapping
Incorporating Prior KnowledgeBayesian Inference

Self-Check Questions

  1. What's the key difference between a confidence interval and a Bayesian credible interval, and when would you prefer one over the other?

  2. You're comparing customer satisfaction scores across four different product versions. Which test should you use, and why would running six separate t-tests be problematic?

  3. Compare and contrast: How do maximum likelihood estimation and Bayesian inference differ in their treatment of parameters? What additional input does Bayesian inference require?

  4. Your data are heavily skewed and your sample size is only 15. Which methods from this guide would still be valid, and which would you avoid?

  5. A study reports p=0.03p = 0.03 but Cohen's d=0.1d = 0.1. Explain what this combination tells you about the results and what you'd want to know about the study design.