🎲Data Science Statistics

Key Concepts in Inferential Statistics

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Inferential statistics is the bridge between your sample data and the broader population you actually care about. Every time you see a poll's "margin of error," read about a drug trial's results, or encounter a claim that one algorithm outperforms another, you're seeing inferential statistics at work. In data science, you're constantly being tested on your ability to quantify uncertainty, compare groups rigorously, and draw defensible conclusions from limited data.

The concepts here aren't just isolated techniques—they form an interconnected toolkit. Hypothesis testing and confidence intervals are two sides of the same coin. Point estimation feeds into maximum likelihood, which connects to Bayesian inference. Understanding these relationships is what separates memorization from mastery. Don't just learn what each method does—know when to use it, what assumptions it requires, and how it relates to the others.

Estimation: From Sample to Population

Before you can test anything, you need to estimate population parameters from your sample. These methods answer the fundamental question: given what I observed, what can I infer about the true underlying value?

Point Estimation

Single-value estimates—provides one "best guess" for a population parameter, such as using $\bar{x}$ to estimate $\mu$
Common estimators include the sample mean, sample proportion, and sample variance, each targeting its population counterpart
No uncertainty information—point estimates alone don't tell you how confident you should be in that single number

Confidence Intervals

Range of plausible values—a 95% CI means if you repeated sampling many times, about 95% of intervals would contain the true parameter
Width reflects precision—narrower intervals indicate more certainty, driven by larger sample sizes and lower variability
Interpretation matters—the parameter is fixed; it's the interval that's random across repeated samples

Maximum Likelihood Estimation

Finds parameters that maximize $L(\theta | \text{data})$ —the likelihood function measures how probable your observed data is under different parameter values
Asymptotically optimal—MLEs are consistent, efficient, and approximately normal for large samples
Foundation for advanced models—used in logistic regression, GLMs, survival analysis, and most modern statistical software

Compare: Point Estimation vs. Confidence Intervals—both estimate population parameters, but point estimates give a single value while CIs quantify the uncertainty around that estimate. FRQs often ask you to interpret CIs correctly—remember, it's about the procedure's reliability, not the probability that the parameter falls in a specific interval.

Hypothesis Testing Framework

Hypothesis testing formalizes the process of using data to make decisions. The core logic: assume nothing interesting is happening (null hypothesis), then ask whether your data are surprising enough to reject that assumption.

Hypothesis Testing

Null vs. alternative— $H_0$ represents "no effect" or "no difference"; $H_a$ is what you're trying to find evidence for
Significance level $\alpha$ —typically 0.05, this is your threshold for how much Type I error (false positive) risk you'll accept
p-value interpretation—the probability of observing data this extreme if $H_0$ were true; small p-values suggest $H_0$ is unlikely

Power Analysis

Determines required sample size—calculates how many observations you need to detect a true effect with high probability
Balances four quantities—effect size, $\alpha$ , sample size $n$ , and power ( $1 - \beta$ ); fixing three determines the fourth
Minimizes Type II errors—ensures your study can actually detect meaningful effects rather than being underpowered

Effect Size Estimation

Quantifies practical significance—statistical significance doesn't mean the effect matters; effect size tells you how much it matters
Common measures include Cohen's $d$ (standardized mean difference), $r^2$ (variance explained), and $\eta^2$ (for ANOVA)
Essential for interpretation—a tiny effect can be statistically significant with large $n$ ; always report effect sizes alongside p-values

Compare: p-values vs. Effect Sizes—p-values tell you whether an effect exists, while effect sizes tell you whether it's meaningful. A study with $n = 100{,}000$ might find $p < 0.001$ for a practically negligible difference. If an FRQ asks about study interpretation, address both.

Comparing Groups: Tests for Means and Variances

When you need to determine whether groups differ, these tests provide the statistical machinery. The choice of test depends on how many groups you're comparing, whether data are paired, and what assumptions you can justify.

t-Tests

Compares means of two groups—independent t-tests for separate groups, paired t-tests for matched or repeated measurements
Test statistic $t = \frac{\bar{x}_1 - \bar{x}_2}{SE}$ —measures how many standard errors apart the means are
Assumptions—normality (relaxed for large $n$ ) and equal variances (Welch's t-test handles unequal variances)

z-Tests

Used when $\sigma$ is known or $n$ is large—relies on the standard normal distribution rather than the t-distribution
Test statistic $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$ —appropriate for proportions and large-sample means
Converges with t-test—as $n \to \infty$ , the t-distribution approaches the standard normal

Analysis of Variance (ANOVA)

Compares means across 3+ groups simultaneously—avoids inflated Type I error from multiple t-tests
F-statistic $F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}$ —ratio of between-group variance to within-group variance
Assumptions—normality, independence, and homogeneity of variances (Levene's test can check the latter)

F-Tests

Compares variances between groups—tests whether populations have equal spread, not just equal centers
Underlies ANOVA and regression—the F-statistic in ANOVA is an F-test; regression uses it to test overall model significance
Sensitive to non-normality—violations of normality affect F-tests more than violations affect t-tests

Compare: t-Test vs. ANOVA—both compare means, but t-tests handle exactly two groups while ANOVA handles three or more. Running multiple t-tests inflates your false positive rate; ANOVA controls this with a single omnibus test. Follow-up with post-hoc tests (Tukey, Bonferroni) to identify which groups differ.

Categorical Data and Associations

Not all data are continuous. When you're working with counts, categories, or proportions, these methods test for relationships and goodness of fit. The key insight: compare what you observed to what you'd expect if variables were independent.

Chi-Square Tests

Tests association between categorical variables—compares observed cell counts to expected counts under independence
Test statistic $\chi^2 = \sum \frac{(O - E)^2}{E}$ —large values indicate observed data deviate substantially from expectation
Two main uses—goodness-of-fit (does data match a distribution?) and test of independence (are variables related?)

Compare: Chi-Square vs. t-Test—chi-square tests work with categorical data and frequencies, while t-tests require continuous outcomes. If your dependent variable is "pass/fail" or "category A/B/C," reach for chi-square. If it's a measurement like height or score, use t-tests.

Modeling Relationships

Regression methods go beyond "is there a difference?" to ask "how are variables related, and can we predict outcomes?" These techniques model the functional form of relationships and quantify how predictors influence responses.

Regression Analysis

Models $Y = f(X) + \epsilon$ —linear regression assumes $Y = \beta_0 + \beta_1 X + \epsilon$ ; logistic regression models log-odds for binary outcomes
Coefficients have interpretations—in linear regression, $\beta_1$ is the expected change in $Y$ for a one-unit increase in $X$
Inference on coefficients—t-tests assess whether individual $\beta_j \neq 0$ ; F-tests assess overall model significance

Compare: ANOVA vs. Regression—mathematically equivalent for comparing group means (ANOVA is regression with dummy variables), but regression extends to continuous predictors and complex models. Use ANOVA language for experimental designs, regression language for predictive modeling.

Alternative Inference Frameworks

Classical frequentist methods aren't the only game in town. These approaches either relax distributional assumptions or incorporate prior knowledge, expanding your inferential toolkit.

Bayesian Inference

Updates beliefs with data—uses Bayes' theorem: $P(\theta | \text{data}) \propto P(\text{data} | \theta) \cdot P(\theta)$
Prior + likelihood = posterior—your prior beliefs about parameters combine with observed evidence to produce updated probabilities
Naturally quantifies uncertainty—posterior distributions give full probability statements about parameters, not just point estimates

Non-parametric Tests

No distributional assumptions—don't require normality; work with ranks rather than raw values
Key examples—Mann-Whitney U (alternative to independent t-test), Wilcoxon signed-rank (paired t-test), Kruskal-Wallis (ANOVA)
Robust but less powerful—sacrifice some ability to detect true effects in exchange for validity under violated assumptions

Bootstrapping

Resamples with replacement—creates thousands of "new" datasets from your original sample to approximate the sampling distribution
Estimates SEs and CIs without formulas—particularly valuable when theoretical distributions are unknown or complex
Handles unusual estimators—works for medians, ratios, or any statistic where closed-form standard errors don't exist

Compare: Frequentist vs. Bayesian—frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random variables with distributions. Frequentist CIs say "95% of intervals constructed this way contain the true value"; Bayesian credible intervals say "there's a 95% probability the parameter lies here." Know which framework a problem assumes.

Quick Reference Table

Concept	Best Examples
Point vs. Interval Estimation	Point Estimation, Confidence Intervals, MLE
Testing Framework	Hypothesis Testing, Power Analysis, Effect Size
Comparing Two Means	t-Tests, z-Tests
Comparing 3+ Means	ANOVA, F-Tests
Categorical Data	Chi-Square Tests
Modeling Relationships	Regression Analysis
Distribution-Free Methods	Non-parametric Tests, Bootstrapping
Incorporating Prior Knowledge	Bayesian Inference

Self-Check Questions

What's the key difference between a confidence interval and a Bayesian credible interval, and when would you prefer one over the other?
You're comparing customer satisfaction scores across four different product versions. Which test should you use, and why would running six separate t-tests be problematic?
Compare and contrast: How do maximum likelihood estimation and Bayesian inference differ in their treatment of parameters? What additional input does Bayesian inference require?
Your data are heavily skewed and your sample size is only 15. Which methods from this guide would still be valid, and which would you avoid?
A study reports $p = 0.03$ but Cohen's $d = 0.1$ . Explain what this combination tells you about the results and what you'd want to know about the study design.