Hypothesis testing gives you a structured way to answer questions about a population using sample data. Instead of guessing whether a treatment works or a difference is real, you follow a formal process that controls how often you'd be wrong. This section covers the core building blocks: hypotheses, errors, and significance levels.

Null vs. Alternative Hypotheses

Every hypothesis test starts with two competing claims about a population parameter:

The null hypothesis ( $H_0$ ) represents the default position: no effect, no difference, nothing going on. It typically includes an equality sign ( $=$ , $\leq$ , or $\geq$ ).
The alternative hypothesis ( $H_1$ or $H_a$ ) is what you're trying to find evidence for. It proposes that something is different, and it uses an inequality ( $<$ , $>$ , or $\neq$ ).

These two hypotheses must be mutually exclusive and exhaustive, meaning exactly one of them is true. For example, if you're testing whether a coin is fair:

$H_0$ : $p = 0.5$ (the coin is fair)
$H_a$ : $p \neq 0.5$ (the coin is not fair)

You never "prove" the null hypothesis. You either reject it (the evidence is strong enough) or fail to reject it (it isn't).

Types of Errors

Because you're making decisions from incomplete data, mistakes are possible. There are exactly two kinds:

Type I error ( $\alpha$ ): You reject $H_0$ when it's actually true. Think of this as a false alarm. If you conclude a drug works when it doesn't, that's a Type I error.
Type II error ( $\beta$ ): You fail to reject $H_0$ when it's actually false. This is a missed detection. The drug really does work, but your test didn't catch it.

These two errors have an inverse relationship: making it harder to commit one type generally makes it easier to commit the other. The power of a test, defined as $1 - \beta$ , measures your ability to correctly detect a real effect. Higher power means fewer missed detections.

Significance Levels

The significance level ( $\alpha$ ) is the threshold you set before collecting data for how much Type I error risk you'll tolerate.

Common choices are 0.05 (5%), 0.01 (1%), and 0.001 (0.1%)
A significance level of 0.05 means you accept a 5% chance of rejecting a true null hypothesis
Smaller $\alpha$ values make your test more conservative: harder to reject $H_0$ , fewer false alarms, but also lower power

The significance level defines the critical region of the sampling distribution. If your test statistic lands in that region, you reject $H_0$ .

Statistical Test Selection

Picking the right test matters. Using the wrong one can give you misleading results, even if your data are perfectly good. The choice depends on your research question, the type of data you have, and how your study was designed.

Parametric vs. Non-Parametric Tests

Parametric tests assume your data come from a specific distribution (usually normal). They tend to be more powerful when their assumptions are met. Examples include t-tests, ANOVA, and linear regression.

Non-parametric tests make fewer assumptions about the underlying population. They work with ranked or ordinal data and are more robust when normality is violated. Examples include the Mann-Whitney U test (compares two groups) and the Kruskal-Wallis test (compares three or more groups). The trade-off is that non-parametric tests are generally less powerful, meaning they need more data to detect the same effect.

One-Tailed vs. Two-Tailed Tests

A one-tailed test checks for an effect in a specific direction. For example: "Does the new drug lower blood pressure?" ( $H_a$ : $\mu < \mu_0$ ). It has more power to detect that specific direction of effect.
A two-tailed test checks for any difference, regardless of direction: "Does the new drug change blood pressure?" ( $H_a$ : $\mu \neq \mu_0$ ). It's more conservative but catches effects in either direction.

Use a one-tailed test only when you have strong prior reason to expect a specific direction and you genuinely don't care about effects in the opposite direction.

Sample Size Considerations

Larger samples give you more statistical power and more precise estimates. A sample of 500 can detect smaller effects than a sample of 30.

Power analysis lets you calculate the sample size needed before you start collecting data. It takes into account your chosen $\alpha$ , your desired power (commonly 0.80), and the smallest effect size you want to detect. There's always a practical trade-off between the ideal sample size and available time, money, and participants.

Steps in Hypothesis Testing

Hypothesis testing follows a consistent sequence. Sticking to this order keeps your reasoning honest and your results defensible.

Formulate hypotheses. State $H_0$ and $H_a$ clearly, specifying the population parameter of interest. Base them on your research question and existing theory. Make sure they're mutually exclusive and leave no gaps.
Choose a test statistic. Select the statistic that fits your data type and hypotheses. Common options include the z-score, t-statistic, chi-square statistic, and F-ratio. Each one follows a known probability distribution, which is what makes the math work.
Set the significance level. Decide on $\alpha$ before looking at the data. Consider what's standard in your field and how serious a false positive would be. If you're running multiple tests, apply a correction like the Bonferroni correction (divide $\alpha$ by the number of tests).
Calculate the p-value. The p-value is the probability of getting results as extreme as (or more extreme than) what you observed, assuming $H_0$ is true. Compute it using statistical software or reference tables.
Make a decision.
- If $p < \alpha$ : reject $H_0$ . The data provide sufficient evidence for $H_a$ .
- If $p \geq \alpha$ : fail to reject $H_0$ . The data don't provide enough evidence against it.

After deciding, interpret your result in context. A p-value is not the probability that $H_0$ is true, and it doesn't tell you how large or important the effect is.

Common Hypothesis Tests

Different scenarios call for different tests. Here are the ones you'll encounter most often.

Null vs alternative hypotheses, Comparing two means – Learning Statistics with R

Z-Test

Use a z-test when the population standard deviation ( $\sigma$ ) is known and you want to compare a sample mean to a hypothesized population mean. It also requires a large sample or a normally distributed population.

$z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$

Where $\bar{x}$ is the sample mean, $\mu$ is the hypothesized population mean, $\sigma$ is the population standard deviation, and $n$ is the sample size.

T-Test

The t-test is far more common in practice because you rarely know $\sigma$ . It uses the sample standard deviation ( $s$ ) instead.

$t = \frac{\bar{x} - \mu}{s / \sqrt{n}}$

There are three main versions:

One-sample t-test: compares a sample mean to a known value
Independent samples t-test: compares means between two separate groups
Paired samples t-test: compares means from the same group measured twice (before/after)

The t-distribution has heavier tails than the normal distribution, especially with small samples, which accounts for the extra uncertainty.

Chi-Square Test

The chi-square test works with categorical data (counts and frequencies), not means.

$\chi^2 = \sum \frac{(O - E)^2}{E}$

Where $O$ is the observed frequency and $E$ is the expected frequency. Two common versions:

Goodness-of-fit test: checks whether observed frequencies match an expected distribution
Test of independence: checks whether two categorical variables are related

One key assumption: expected frequencies in each category should be at least 5.

ANOVA

Analysis of Variance extends the t-test to compare means across three or more groups simultaneously. Instead of running multiple t-tests (which inflates Type I error), ANOVA uses a single F-test.

The F-ratio compares the variance between groups to the variance within groups. A large F-ratio suggests the group means aren't all equal.

One-way ANOVA: one independent variable with multiple levels
Two-way ANOVA: two independent variables (can test for interaction effects)
Repeated measures ANOVA: same subjects measured under multiple conditions

Assumptions: normality, homogeneity of variance across groups, and independence of observations.

Assumptions and Limitations

Every statistical test rests on assumptions. Violating them can produce unreliable results, so you should always check before running a test.

Normality Assumption

Many parametric tests assume the data (or residuals) follow a normal distribution. You can assess this with:

Visual methods: Q-Q plots, histograms
Formal tests: Shapiro-Wilk test

The good news: most parametric tests are fairly robust to mild violations of normality, especially with larger samples. The Central Limit Theorem helps here, since sample means tend toward normality as $n$ increases regardless of the population shape. For severe violations, consider transforming the data (e.g., log transform) or switching to a non-parametric test.

Independence Assumption

Observations must be independent of one another. If one data point influences another, standard error estimates become unreliable.

This assumption is violated in repeated measures designs, clustered data (students within classrooms), or time-series data. Specialized methods like mixed-effects models can handle these situations. Proper random sampling and experimental design are the best ways to ensure independence from the start.

Homogeneity of Variance

Many tests (independent samples t-test, ANOVA) assume equal variances across groups. You can check this with Levene's test or by visually comparing spread in boxplots.

When variances are unequal:

For two groups: use Welch's t-test, which adjusts the degrees of freedom
For multiple groups: use Welch's ANOVA or apply a variance-stabilizing transformation

Violations of this assumption tend to inflate the Type I error rate, meaning you might reject $H_0$ more often than you should.

Interpreting Results

Getting a p-value is only half the job. Interpreting what it means in context is where the real thinking happens.

Null vs alternative hypotheses, Hypothesis Testing (5 of 5) | Concepts in Statistics

Statistical vs. Practical Significance

A result can be statistically significant but practically meaningless. With a large enough sample, even a tiny, trivial difference can produce $p < 0.05$ . For example, a study with 100,000 participants might find that a new teaching method improves test scores by 0.1 points on a 100-point scale. That's statistically significant but not useful.

Always ask: Is this effect large enough to matter in the real world?

Confidence Intervals

A confidence interval gives a range of plausible values for the population parameter. A 95% confidence interval means that if you repeated the study many times, about 95% of the intervals you'd construct would contain the true parameter.

Confidence intervals complement p-values by showing the precision of your estimate. A narrow interval means your estimate is precise; a wide one means there's a lot of uncertainty. Reporting confidence intervals alongside p-values gives a much fuller picture than p-values alone.

Effect Size

Effect size measures the magnitude of an effect, independent of sample size. Common measures include:

Cohen's d: standardized difference between two means. Small ≈ 0.2, medium ≈ 0.5, large ≈ 0.8.
Pearson's r: correlation coefficient. Ranges from -1 to 1.
Odds ratio: used in categorical data to compare the odds of an event between groups.

Effect sizes allow meaningful comparison across different studies and are essential for meta-analyses.

Advanced Concepts

These topics extend the basics and address some of the known limitations of standard hypothesis testing.

Multiple Comparisons Problem

Running many tests on the same dataset inflates your overall Type I error rate. If you run 20 tests at $\alpha = 0.05$ , you'd expect about 1 false positive even if nothing is going on.

Two ways to think about the problem:

Family-wise error rate (FWER): the probability of making at least one Type I error across all tests
False discovery rate (FDR): the expected proportion of rejected hypotheses that are false positives

Correction methods, from most to least conservative:

Bonferroni: divide $\alpha$ by the number of tests. Simple but very conservative.
Holm-Bonferroni: a step-down procedure that's more powerful than Bonferroni while still controlling FWER.
Benjamini-Hochberg: controls FDR instead of FWER. More powerful, widely used in fields with many simultaneous tests.

Power Analysis

Power analysis helps you plan studies that have a realistic chance of detecting the effect you're looking for.

The four quantities involved are interconnected: if you know any three, you can solve for the fourth.

Significance level ( $\alpha$ )
Power ( $1 - \beta$ , commonly set at 0.80)
Effect size (how large the true effect is)
Sample size ( $n$ )

A priori power analysis (done before data collection) determines the sample size you need. Post hoc power analysis (done after) is sometimes used to interpret non-significant results, though its usefulness is debated. The biggest challenge is estimating a realistic effect size, since overly optimistic estimates lead to underpowered studies.

Bayesian Hypothesis Testing

The Bayesian approach offers an alternative to the frequentist framework covered above. Instead of asking "How likely are these data if $H_0$ is true?", it asks "How likely is $H_0$ given these data?"

Key ideas:

You start with prior probabilities reflecting what you believed before seeing the data
You update those beliefs using Bayes' theorem to get posterior probabilities
Bayes factors quantify how much the data support one hypothesis over another (e.g., a Bayes factor of 10 means the data are 10 times more likely under $H_1$ than $H_0$ )

This framework allows you to accumulate evidence across studies and avoids some of the interpretive pitfalls of p-values. However, the choice of prior can influence results, which is both a strength (you can incorporate existing knowledge) and a criticism (priors can be subjective).

Applications in Research

Hypothesis testing doesn't exist in a vacuum. How you design your study, collect your data, and report your findings all affect whether your statistical conclusions are trustworthy.

Experimental Design

Randomization assigns participants to groups by chance, controlling for confounding variables
Blinding (single or double) reduces bias in data collection and interpretation
Factorial designs let you examine multiple factors and their interactions simultaneously
Use power analysis to determine sample size before starting
Account for ethical constraints and practical limitations in your design

Data Collection Methods

Follow standardized protocols so measurements are consistent
Plan ahead for how you'll handle missing data and outliers
Document every step of your data collection process for reproducibility
Consider measurement error and how it might affect your analyses

Reporting Results

Clear reporting makes your work reproducible and credible. Best practices include:

Report effect sizes and confidence intervals alongside p-values
Describe the statistical methods and assumptions you used
Be transparent about multiple comparisons and any subgroup analyses
Acknowledge limitations and potential sources of bias
Follow field-specific reporting guidelines (APA for psychology, CONSORT for clinical trials, STROBE for observational studies)