๐ŸŽฒData Science Statistics

Key Concepts in Inferential Statistics

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Inferential statistics is the bridge between your sample data and the broader population you actually care about. Every time you see a poll's "margin of error," read about a drug trial's results, or encounter a claim that one algorithm outperforms another, you're seeing inferential statistics at work. In data science and statistics courses, you're constantly being tested on your ability to quantify uncertainty, compare groups rigorously, and draw defensible conclusions from limited data.

These concepts form an interconnected toolkit, not a list of isolated techniques. Hypothesis testing and confidence intervals are two sides of the same coin. Point estimation feeds into maximum likelihood, which connects to Bayesian inference. Understanding these relationships is what separates memorization from mastery. For each method, know when to use it, what assumptions it requires, and how it relates to the others.


Estimation: From Sample to Population

Before you can test anything, you need to estimate population parameters from your sample. These methods answer a fundamental question: given what I observed, what can I infer about the true underlying value?

Point Estimation

A point estimate is your single "best guess" for a population parameter. For example, the sample mean xห‰\bar{x} estimates the population mean ฮผ\mu, and the sample proportion p^\hat{p} estimates the population proportion pp.

  • Common estimators include the sample mean, sample proportion, and sample variance, each targeting its population counterpart
  • Desirable properties of estimators: unbiasedness (the expected value of the estimator equals the parameter), consistency (accuracy improves as nn grows), and efficiency (smallest variance among unbiased estimators)
  • Limitation: a point estimate alone tells you nothing about how confident you should be in that number

Confidence Intervals

A confidence interval (CI) provides a range of plausible values for a parameter, built around a point estimate. The general form is:

pointย estimateยฑ(criticalย value)ร—(standardย error)\text{point estimate} \pm (\text{critical value}) \times (\text{standard error})

  • Correct interpretation: a 95% CI means that if you repeated the sampling process many times, about 95% of the resulting intervals would contain the true parameter. The parameter is fixed; it's the interval that varies across samples.
  • Width reflects precision. Narrower intervals come from larger sample sizes, lower variability, or a lower confidence level (e.g., 90% vs. 99%).
  • Common mistake: saying "there's a 95% probability the parameter is in this interval." That's a Bayesian credible interval interpretation, not a frequentist CI.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) finds the parameter values that make your observed data most probable. You write a likelihood function L(ฮธโˆฃdata)L(\theta \mid \text{data}) and find the ฮธ\theta that maximizes it (often by maximizing the log-likelihood instead, since it's easier to work with).

  • Asymptotically optimal: for large samples, MLEs are consistent, efficient, and approximately normally distributed
  • Foundation for advanced models: logistic regression, generalized linear models (GLMs), and survival analysis all rely on MLE under the hood

Compare: Point Estimation vs. Confidence Intervals: both estimate population parameters, but point estimates give a single value while CIs quantify the uncertainty around that estimate. If you're asked to interpret a CI, remember it's about the procedure's reliability across repeated samples, not the probability that the parameter falls in one specific interval.


Hypothesis Testing Framework

Hypothesis testing formalizes the process of using data to make decisions. The core logic: assume nothing interesting is happening (null hypothesis), then ask whether your data are surprising enough to reject that assumption.

Hypothesis Testing

The process follows a consistent structure:

  1. State hypotheses. H0H_0 represents "no effect" or "no difference"; HaH_a is what you're trying to find evidence for.
  2. Choose a significance level ฮฑ\alpha (typically 0.05). This is the maximum Type I error rate (false positive) you'll tolerate.
  3. Compute a test statistic from your data and find the corresponding p-value.
  4. Make a decision. If pโ‰คฮฑp \leq \alpha, reject H0H_0. If p>ฮฑp > \alpha, fail to reject H0H_0.

The p-value is the probability of observing data this extreme or more extreme, assuming H0H_0 is true. A small p-value means your data would be unlikely under the null, which is evidence against it. A large p-value does not prove H0H_0 is true; it just means you lack sufficient evidence to reject it.

Two types of errors to track:

  • Type I error (ฮฑ\alpha): rejecting H0H_0 when it's actually true (false positive)
  • Type II error (ฮฒ\beta): failing to reject H0H_0 when it's actually false (false negative)

Power Analysis

Statistical power (1โˆ’ฮฒ1 - \beta) is the probability of correctly rejecting H0H_0 when a real effect exists. Power analysis is typically done before collecting data to determine how large your sample needs to be.

  • Four linked quantities: effect size, ฮฑ\alpha, sample size nn, and power. Fix any three and the fourth is determined.
  • Typical target: power of 0.80 (80% chance of detecting a true effect)
  • Why it matters: an underpowered study wastes resources because it's unlikely to detect the effect you're looking for, even if it's real

Effect Size Estimation

Statistical significance alone doesn't tell you whether an effect is meaningful. Effect size quantifies the magnitude of a difference or relationship.

  • Cohen's dd measures the standardized mean difference: d=xห‰1โˆ’xห‰2spooledd = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}. Benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large.
  • r2r^2 (coefficient of determination) measures the proportion of variance explained by a model.
  • ฮท2\eta^2 serves a similar role in ANOVA contexts.

A tiny effect can be statistically significant with a large enough nn. Always report and consider effect sizes alongside p-values.

Compare: p-values vs. Effect Sizes: p-values tell you whether an effect is likely real, while effect sizes tell you whether it's practically meaningful. A study with n=100,000n = 100{,}000 might find p<0.001p < 0.001 for a negligible difference. When interpreting results, address both.


Comparing Groups: Tests for Means and Variances

When you need to determine whether groups differ, these tests provide the statistical machinery. The choice of test depends on how many groups you're comparing, whether data are paired, and what assumptions you can justify.

t-Tests

The t-test compares means when the population standard deviation is unknown (which is almost always the case in practice).

  • Independent-samples t-test: compares means of two separate groups (e.g., treatment vs. control)
  • Paired t-test: compares means from matched or repeated measurements on the same subjects (e.g., before vs. after)
  • Test statistic: t=xห‰1โˆ’xห‰2SEt = \frac{\bar{x}_1 - \bar{x}_2}{SE}, which measures how many standard errors apart the two means are
  • Assumptions: independence of observations, approximate normality (relaxed for large nn by the Central Limit Theorem), and equal variances for the standard version. Welch's t-test is the safer default because it doesn't assume equal variances.

z-Tests

The z-test is used when the population standard deviation ฯƒ\sigma is known or when nn is large enough that the normal approximation applies (common for proportions).

  • Test statistic: z=xห‰โˆ’ฮผ0ฯƒ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}
  • Most common use: testing hypotheses about proportions, where SE=p0(1โˆ’p0)nSE = \sqrt{\frac{p_0(1 - p_0)}{n}}
  • As nโ†’โˆžn \to \infty, the t-distribution approaches the standard normal, so t-tests and z-tests converge for large samples

Analysis of Variance (ANOVA)

ANOVA compares means across three or more groups simultaneously, avoiding the inflated Type I error that comes from running multiple t-tests.

  • F-statistic: F=MSbetweenMSwithinF = \frac{MS_{\text{between}}}{MS_{\text{within}}}, the ratio of between-group variance to within-group variance. A large FF means groups differ more than you'd expect from random variation alone.
  • Assumptions: normality within groups, independence, and homogeneity of variances (check with Levene's test)
  • ANOVA only tells you that at least one group differs. To find which groups differ, use post-hoc tests like Tukey's HSD or Bonferroni correction.

F-Tests

The F-test compares two variances by taking their ratio. It's the engine behind ANOVA and also tests overall significance in regression.

  • In ANOVA: the F-statistic tests whether any group means differ
  • In regression: the F-test checks whether the model as a whole explains significant variance in the outcome
  • Sensitive to non-normality: violations of the normality assumption affect F-tests more than they affect t-tests

Compare: t-Test vs. ANOVA: both compare means, but t-tests handle exactly two groups while ANOVA handles three or more. Running multiple t-tests (e.g., six pairwise comparisons for four groups) inflates your false positive rate well beyond ฮฑ=0.05\alpha = 0.05. ANOVA controls this with a single omnibus test, then you follow up with post-hoc tests to pinpoint which groups differ.


Categorical Data and Associations

Not all data are continuous. When you're working with counts, categories, or proportions, chi-square methods test for relationships and goodness of fit. The core idea: compare what you observed to what you'd expect if variables were independent.

Chi-Square Tests

The chi-square test compares observed frequencies to expected frequencies using the statistic:

ฯ‡2=โˆ‘(Oโˆ’E)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Large values mean the observed data deviate substantially from what's expected.

Two main applications:

  • Goodness-of-fit test: does your data match a hypothesized distribution? (e.g., are die rolls uniformly distributed?)
  • Test of independence: are two categorical variables related? (e.g., is there an association between gender and product preference?)

Assumptions: observations are independent, and expected cell counts should generally be at least 5. For small expected counts, consider Fisher's exact test instead.

Compare: Chi-Square vs. t-Test: chi-square tests work with categorical data and frequencies, while t-tests require a continuous outcome variable. If your dependent variable is "pass/fail" or "category A/B/C," use chi-square. If it's a measurement like height or test score, use a t-test.


Modeling Relationships

Regression methods go beyond "is there a difference?" to ask "how are variables related, and can we predict outcomes?" These techniques model the functional form of relationships and quantify how predictors influence responses.

Regression Analysis

The general linear regression model is Y=ฮฒ0+ฮฒ1X+ฯตY = \beta_0 + \beta_1 X + \epsilon, where ฯต\epsilon represents random error.

  • Coefficient interpretation: ฮฒ1\beta_1 is the expected change in YY for a one-unit increase in XX, holding other predictors constant (in multiple regression)
  • Inference on coefficients: t-tests assess whether individual ฮฒjโ‰ 0\beta_j \neq 0; the overall F-test assesses whether the model as a whole is significant
  • Logistic regression extends this framework to binary outcomes by modeling the log-odds: lnโก(p1โˆ’p)=ฮฒ0+ฮฒ1X\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X
  • Key assumptions for linear regression: linearity, independence of errors, constant variance of errors (homoscedasticity), and normality of errors

Compare: ANOVA vs. Regression: these are mathematically equivalent for comparing group means (ANOVA is just regression with dummy-coded categorical predictors). Regression extends naturally to continuous predictors and more complex models. Use ANOVA language for experimental designs with categorical factors, and regression language for predictive modeling with continuous or mixed predictors.


Alternative Inference Frameworks

Classical frequentist methods aren't the only approach. These alternatives either relax distributional assumptions or incorporate prior knowledge, expanding your inferential toolkit.

Bayesian Inference

Bayesian inference updates prior beliefs with observed data using Bayes' theorem:

P(ฮธโˆฃdata)โˆP(dataโˆฃฮธ)โ‹…P(ฮธ)P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) \cdot P(\theta)

  • Prior P(ฮธ)P(\theta) encodes what you believed about the parameter before seeing data
  • Likelihood P(dataโˆฃฮธ)P(\text{data} \mid \theta) is the same likelihood function used in MLE
  • Posterior P(ฮธโˆฃdata)P(\theta \mid \text{data}) is your updated belief after incorporating the data
  • The posterior distribution gives full probability statements about parameters, not just point estimates. A 95% credible interval means there's a 95% probability the parameter lies within that range (contrast this with the frequentist CI interpretation).

Non-parametric Tests

Non-parametric tests don't assume a specific distribution (like normality) for the data. They typically work with ranks rather than raw values.

  • Mann-Whitney U: alternative to the independent-samples t-test
  • Wilcoxon signed-rank: alternative to the paired t-test
  • Kruskal-Wallis: alternative to one-way ANOVA
  • Trade-off: these tests are robust under violated assumptions but have less statistical power than their parametric counterparts when the assumptions are met

Bootstrapping

Bootstrapping estimates the sampling distribution of a statistic by resampling with replacement from your original data, typically thousands of times.

  1. Draw a sample of size nn (with replacement) from your original data.
  2. Compute the statistic of interest (mean, median, ratio, etc.) for that resample.
  3. Repeat steps 1-2 many times (e.g., 10,000 iterations).
  4. Use the distribution of resampled statistics to estimate standard errors and construct confidence intervals.

This approach is especially valuable when theoretical formulas for standard errors don't exist or when the sampling distribution is hard to derive analytically (e.g., for medians, correlation coefficients, or custom statistics).

Compare: Frequentist vs. Bayesian: frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random variables with distributions. A frequentist 95% CI says "95% of intervals constructed this way would contain the true value." A Bayesian 95% credible interval says "there's a 95% probability the parameter lies in this range." Know which framework a problem assumes, because the interpretation changes.


Quick Reference Table

ConceptBest Examples
Point vs. Interval EstimationPoint Estimation, Confidence Intervals, MLE
Testing FrameworkHypothesis Testing, Power Analysis, Effect Size
Comparing Two Meanst-Tests, z-Tests
Comparing 3+ MeansANOVA, F-Tests
Categorical DataChi-Square Tests
Modeling RelationshipsRegression Analysis
Distribution-Free MethodsNon-parametric Tests, Bootstrapping
Incorporating Prior KnowledgeBayesian Inference

Self-Check Questions

  1. What's the key difference between a confidence interval and a Bayesian credible interval, and when would you prefer one over the other?

  2. You're comparing customer satisfaction scores across four different product versions. Which test should you use, and why would running six separate t-tests be problematic?

  3. How do maximum likelihood estimation and Bayesian inference differ in their treatment of parameters? What additional input does Bayesian inference require?

  4. Your data are heavily skewed and your sample size is only 15. Which methods from this guide would still be valid, and which would you avoid?

  5. A study reports p=0.03p = 0.03 but Cohen's d=0.1d = 0.1. Explain what this combination tells you about the results and what you'd want to know about the study design.