Inferential statistics gives you the tools to draw conclusions about entire populations using only sample data. It covers how to estimate unknown parameters, test hypotheses, and measure how confident you should be in your results.

Foundations of inferential statistics

Inferential statistics is how you go from "here's what the data in my sample looks like" to "here's what's probably true about the whole population." The core idea is that a well-chosen sample can tell you a lot about a much larger group, as long as you account for uncertainty.

Population vs sample

A population is every individual or item you're interested in studying. A sample is a smaller subset you actually collect data from. The quality of your inference depends heavily on how you select that sample.

Random sampling gives each member of the population an equal chance of being selected, which helps avoid bias.
Stratified sampling divides the population into subgroups first (like age brackets or income levels), then samples from each subgroup. This ensures representation across key categories.

Parameters vs statistics

Parameters describe the whole population. You'll see $\mu$ for the population mean and $\sigma$ for the population standard deviation.
Statistics are calculated from your sample and serve as estimates of those parameters. The sample mean is $\bar{x}$ and the sample standard deviation is $s$ .

The sampling distribution is the bridge between these two: it describes how a statistic (like $\bar{x}$ ) would vary if you took many different samples from the same population. The standard error measures that variability, telling you how much your sample statistic is likely to bounce around from sample to sample.

Sampling methods

Simple random sampling gives every member an equal chance of selection.
Systematic sampling picks every $n$ th item from a list (for example, every 10th name on a roster).
Cluster sampling divides the population into clusters, then randomly selects entire clusters to study.
Convenience sampling uses whoever is easiest to reach. It's quick but often introduces bias, so treat results with caution.

Probability distributions

Probability distributions model how likely different outcomes are in a random process. They're the mathematical foundation for nearly every inferential technique you'll use.

Normal distribution

The normal distribution is the classic bell-shaped, symmetric curve, defined by its mean $\mu$ and standard deviation $\sigma$ . The 68-95-99.7 rule is worth memorizing:

About 68% of data falls within 1 standard deviation of the mean
About 95% within 2 standard deviations
About 99.7% within 3 standard deviations

Z-scores let you standardize any normal distribution so you can compare values across different scales. A z-score tells you how many standard deviations a value is from the mean: $z = \frac{x - \mu}{\sigma}$ .

The Central Limit Theorem (CLT) is one of the most important results in statistics: no matter what the original population looks like, the distribution of sample means will approximate a normal distribution as sample size grows. This is why so many tests rely on the normal distribution.

t-distribution

The t-distribution looks like the normal distribution but has heavier tails, meaning extreme values are more likely. You use it when your sample size is small or when you don't know the population standard deviation (which is most of the time in practice).

Its shape depends on degrees of freedom (usually $n - 1$ for a single sample). As degrees of freedom increase, the t-distribution gets closer and closer to the normal distribution.

Chi-square distribution

The chi-square distribution is always positive and right-skewed. It's used for goodness-of-fit tests and tests of independence involving categorical data. Like the t-distribution, its shape depends on degrees of freedom, and it approaches a normal shape as degrees of freedom increase.

Confidence intervals

A confidence interval gives you a range of plausible values for a population parameter, rather than a single point estimate. It's a way of saying "the true value is probably somewhere in here."

Margin of error

The margin of error is the "plus or minus" part of a confidence interval. It equals the critical value multiplied by the standard error:

$\text{Margin of Error} = z^* \times SE$

A larger sample size shrinks the standard error, which shrinks the margin of error, which gives you a narrower (more precise) interval.

Confidence level

The confidence level (commonly 90%, 95%, or 99%) tells you how confident you are that the interval captures the true parameter. A 95% confidence level means that if you repeated the study many times, about 95% of the resulting intervals would contain the true value.

There's a trade-off here: higher confidence means a wider interval. A 99% confidence interval is wider than a 95% interval from the same data. You gain confidence but lose precision.

Sample size considerations

Larger samples generally produce narrower confidence intervals, giving you more precise estimates. But bigger samples cost more time and money. Power analysis helps you figure out the minimum sample size needed to achieve a desired level of precision before you start collecting data.

Hypothesis testing

Hypothesis testing is a formal procedure for deciding whether sample data provides enough evidence to reject a claim about a population. Here's the general process:

State your null hypothesis ( $H_0$ ) and alternative hypothesis ( $H_a$ ).
Choose a significance level ( $\alpha$ ).
Collect data and calculate a test statistic.
Find the p-value or compare the test statistic to a critical value.
Decide whether to reject or fail to reject $H_0$ .

Null vs alternative hypotheses

The null hypothesis ( $H_0$ ) assumes no effect or no difference. It's the default position.
The alternative hypothesis ( $H_a$ ) is what you're trying to find evidence for.

A one-tailed test looks for an effect in a specific direction (e.g., "the new drug lowers blood pressure"). A two-tailed test looks for any difference in either direction (e.g., "the new drug changes blood pressure").

Population vs sample, Stratified sampling - Wikipedia

Type I and Type II errors

These are the two ways a hypothesis test can go wrong:

Type I error (false positive): You reject $H_0$ when it's actually true. The probability of this is $\alpha$ , your significance level.
Type II error (false negative): You fail to reject $H_0$ when it's actually false. The probability of this is $\beta$ .

Lowering $\alpha$ (making it harder to reject $H_0$ ) reduces Type I errors but increases the risk of Type II errors. There's always a trade-off.

p-values and significance levels

The p-value is the probability of getting results as extreme as (or more extreme than) what you observed, assuming $H_0$ is true. A small p-value means your data would be unlikely under the null hypothesis.

You reject $H_0$ when the p-value is less than your chosen significance level $\alpha$ . Common thresholds are $\alpha = 0.05$ and $\alpha = 0.01$ . A p-value of 0.03 with $\alpha = 0.05$ leads to rejection; the same p-value with $\alpha = 0.01$ does not.

Statistical tests

Different research questions call for different tests. The choice depends on what you're comparing, what type of data you have, and whether your data meets certain assumptions.

t-tests

t-tests compare means. There are three common versions:

One-sample t-test: Compares a sample mean to a known or hypothesized value. (Example: "Is the average test score in this class different from 75?")
Independent samples t-test: Compares means between two separate groups. (Example: "Do students who study with music score differently than those who study in silence?")
Paired samples t-test: Compares means from the same group at two different times. (Example: "Did patients' blood pressure change after treatment?")

ANOVA

Analysis of Variance (ANOVA) extends the logic of t-tests to three or more groups. Instead of comparing just two means, ANOVA tests whether any of the group means differ significantly.

One-way ANOVA examines the effect of one independent variable.
Two-way ANOVA examines the effects of two independent variables and their interaction.

ANOVA uses the F-statistic to assess whether the variation between groups is large enough relative to the variation within groups to be considered significant.

Chi-square tests

Chi-square tests work with categorical (count) data rather than means.

Goodness-of-fit test: Checks whether observed frequencies match expected frequencies. (Example: "Are the six sides of this die equally likely?")
Test of independence: Checks whether two categorical variables are related. (Example: "Is there a relationship between gender and preferred study method?")

Key assumptions: observations must be independent, and expected frequencies should be at least 5 in each cell.

Regression analysis

Regression models the relationship between variables, letting you predict one variable based on others and quantify how strong that relationship is.

Simple linear regression

Simple linear regression models the relationship between one independent variable ( $X$ ) and one dependent variable ( $Y$ ):

$Y = \beta_0 + \beta_1 X + \varepsilon$

Here, $\beta_0$ is the y-intercept, $\beta_1$ is the slope (how much $Y$ changes for a one-unit increase in $X$ ), and $\varepsilon$ is the error term. The least squares method finds the line that minimizes the sum of squared residuals (the distances between observed and predicted values).

$R^2$ (R-squared) tells you the proportion of variance in $Y$ that's explained by $X$ . An $R^2$ of 0.72 means 72% of the variation in $Y$ is accounted for by the model.

Multiple regression

Multiple regression adds more independent variables:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon$

Each partial regression coefficient ( $\beta_i$ ) represents the effect of that variable while holding all other variables constant. Adjusted $R^2$ is preferred over regular $R^2$ here because it penalizes you for adding predictors that don't genuinely improve the model.

Correlation coefficients

Pearson's $r$ measures the strength and direction of a linear relationship between two continuous variables. It ranges from $-1$ (perfect negative) to $+1$ (perfect positive), with 0 meaning no linear relationship.
Spearman's $\rho$ is used for ordinal data or when the relationship isn't linear.
Point-biserial correlation applies when one variable is continuous and the other is dichotomous (two categories).

Correlation does not imply causation. Two variables can be strongly correlated without one causing the other.

Bayesian inference

Bayesian inference is an alternative framework that incorporates prior knowledge into statistical analysis. Instead of asking "how likely is this data given the hypothesis?" (the frequentist question), Bayesian inference asks "how likely is the hypothesis given this data?"

Bayes' theorem

The foundation of Bayesian statistics:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

This formula lets you update the probability of a hypothesis ( $A$ ) after observing evidence ( $B$ ).

Prior vs posterior probabilities

The prior probability is your belief about a parameter before seeing any data.
The likelihood describes how probable the observed data is under different parameter values.
The posterior probability combines the prior and the likelihood to give you an updated belief after seeing the data.

This updating process can be repeated: today's posterior becomes tomorrow's prior when new data arrives.

Population vs sample, Distribution of Sample Proportions (5 of 6) | Concepts in Statistics

Bayesian vs frequentist approaches

Frequentist: Parameters are fixed but unknown. Data is random. You ask, "If I repeated this experiment many times, what would happen?"

Bayesian: Parameters are treated as random variables with probability distributions. Data is fixed (it's what you observed). You ask, "Given what I observed, what do I believe about the parameter?"

Neither approach is universally "better." Frequentist methods dominate traditional statistics courses, while Bayesian methods are increasingly used in machine learning and fields where incorporating prior knowledge is valuable.

Sampling distributions

Sampling distributions describe how a statistic (like the sample mean) varies across many hypothetical samples from the same population. They're the theoretical backbone of confidence intervals and hypothesis tests.

Central limit theorem

The Central Limit Theorem (CLT) states that the sampling distribution of $\bar{x}$ approaches a normal distribution as sample size increases, regardless of the shape of the original population distribution. The rule of thumb is that $n \geq 30$ is usually sufficient, though populations that are already roughly normal need smaller samples.

This is why the normal distribution shows up everywhere in inferential statistics: even if individual data points aren't normally distributed, sample means tend to be.

Standard error

The standard error (SE) is the standard deviation of the sampling distribution. For a sample mean:

$SE = \frac{s}{\sqrt{n}}$

Notice that SE decreases as $n$ increases. Quadrupling your sample size cuts the standard error in half.

Sampling variability

Different samples from the same population will give you different statistics. That's sampling variability, and it's completely normal. Larger samples reduce this variability, which is why bigger samples give you more reliable estimates. Understanding that your sample statistic won't exactly equal the population parameter is fundamental to all of inferential statistics.

Effect size and power

A result can be statistically significant without being practically meaningful. Effect size and power help you evaluate whether a result actually matters and whether your study was designed well enough to detect it.

Cohen's d

Cohen's d is a standardized measure of the difference between two group means:

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}$

where $s_p$ is the pooled standard deviation. General benchmarks:

Small: $d = 0.2$
Medium: $d = 0.5$
Large: $d = 0.8$

These benchmarks give you a rough sense of magnitude, but what counts as "meaningful" depends on the context.

Statistical power

Power is the probability of correctly rejecting a false null hypothesis, calculated as $1 - \beta$ . Three main factors influence power:

Effect size: Larger effects are easier to detect.
Sample size: More data means more power.
Significance level ( $\alpha$ ): A more lenient threshold (e.g., 0.05 vs. 0.01) gives more power but increases Type I error risk.

A power of 0.80 (80%) is the conventional minimum target, meaning you have an 80% chance of detecting a real effect if one exists.

Sample size determination

A priori power analysis (done before collecting data) estimates the sample size you need to detect a given effect size with a desired level of power. This is the preferred approach because it helps you plan efficiently.

Post hoc power analysis (done after the study) calculates the power you actually achieved. It's less useful for decision-making but can help explain non-significant results.

The balancing act: you want enough participants for adequate power, but every additional participant costs time and resources.

Advanced inferential techniques

These methods go beyond the standard toolkit and are used for more complex research questions or situations where traditional assumptions don't hold.

Bootstrapping

Bootstrapping is a resampling technique. Instead of relying on theoretical formulas for sampling distributions, you:

Take your original sample.
Draw many new samples (thousands) from it with replacement.
Calculate your statistic of interest for each resampled dataset.
Use the distribution of those statistics to estimate standard errors and confidence intervals.

Bootstrapping is especially useful when you can't assume normality or when no standard formula exists for your statistic.

Meta-analysis

Meta-analysis statistically combines results from multiple independent studies on the same question. By pooling data across studies, it increases statistical power and provides more precise effect size estimates. Good meta-analyses also account for between-study variability and publication bias (the tendency for studies with significant results to be published more often).

Multivariate analysis

Multivariate techniques analyze relationships among multiple variables simultaneously. Examples include:

MANOVA: Extends ANOVA to multiple dependent variables at once.
Factor analysis: Identifies underlying latent variables that explain patterns in observed data.
Discriminant analysis: Predicts group membership based on multiple predictors.

These methods account for correlations among variables, giving you a more complete picture than analyzing each variable separately.