Fundamentals of Hypothesis Testing
Hypothesis testing gives you a structured way to answer questions about a population using sample data. Instead of guessing whether a treatment works or a difference is real, you follow a formal process that controls how often you'd be wrong. This section covers the core building blocks: hypotheses, errors, and significance levels.
Null vs. Alternative Hypotheses
Every hypothesis test starts with two competing claims about a population parameter:
- The null hypothesis () represents the default position: no effect, no difference, nothing going on. It typically includes an equality sign (, , or ).
- The alternative hypothesis ( or ) is what you're trying to find evidence for. It proposes that something is different, and it uses an inequality (, , or ).
These two hypotheses must be mutually exclusive and exhaustive, meaning exactly one of them is true. For example, if you're testing whether a coin is fair:
- : (the coin is fair)
- : (the coin is not fair)
You never "prove" the null hypothesis. You either reject it (the evidence is strong enough) or fail to reject it (it isn't).
Types of Errors
Because you're making decisions from incomplete data, mistakes are possible. There are exactly two kinds:
- Type I error (): You reject when it's actually true. Think of this as a false alarm. If you conclude a drug works when it doesn't, that's a Type I error.
- Type II error (): You fail to reject when it's actually false. This is a missed detection. The drug really does work, but your test didn't catch it.
These two errors have an inverse relationship: making it harder to commit one type generally makes it easier to commit the other. The power of a test, defined as , measures your ability to correctly detect a real effect. Higher power means fewer missed detections.
Significance Levels
The significance level () is the threshold you set before collecting data for how much Type I error risk you'll tolerate.
- Common choices are 0.05 (5%), 0.01 (1%), and 0.001 (0.1%)
- A significance level of 0.05 means you accept a 5% chance of rejecting a true null hypothesis
- Smaller values make your test more conservative: harder to reject , fewer false alarms, but also lower power
The significance level defines the critical region of the sampling distribution. If your test statistic lands in that region, you reject .
Statistical Test Selection
Picking the right test matters. Using the wrong one can give you misleading results, even if your data are perfectly good. The choice depends on your research question, the type of data you have, and how your study was designed.
Parametric vs. Non-Parametric Tests
Parametric tests assume your data come from a specific distribution (usually normal). They tend to be more powerful when their assumptions are met. Examples include t-tests, ANOVA, and linear regression.
Non-parametric tests make fewer assumptions about the underlying population. They work with ranked or ordinal data and are more robust when normality is violated. Examples include the Mann-Whitney U test (compares two groups) and the Kruskal-Wallis test (compares three or more groups). The trade-off is that non-parametric tests are generally less powerful, meaning they need more data to detect the same effect.
One-Tailed vs. Two-Tailed Tests
- A one-tailed test checks for an effect in a specific direction. For example: "Does the new drug lower blood pressure?" (: ). It has more power to detect that specific direction of effect.
- A two-tailed test checks for any difference, regardless of direction: "Does the new drug change blood pressure?" (: ). It's more conservative but catches effects in either direction.
Use a one-tailed test only when you have strong prior reason to expect a specific direction and you genuinely don't care about effects in the opposite direction.
Sample Size Considerations
Larger samples give you more statistical power and more precise estimates. A sample of 500 can detect smaller effects than a sample of 30.
Power analysis lets you calculate the sample size needed before you start collecting data. It takes into account your chosen , your desired power (commonly 0.80), and the smallest effect size you want to detect. There's always a practical trade-off between the ideal sample size and available time, money, and participants.
Steps in Hypothesis Testing
Hypothesis testing follows a consistent sequence. Sticking to this order keeps your reasoning honest and your results defensible.
-
Formulate hypotheses. State and clearly, specifying the population parameter of interest. Base them on your research question and existing theory. Make sure they're mutually exclusive and leave no gaps.
-
Choose a test statistic. Select the statistic that fits your data type and hypotheses. Common options include the z-score, t-statistic, chi-square statistic, and F-ratio. Each one follows a known probability distribution, which is what makes the math work.
-
Set the significance level. Decide on before looking at the data. Consider what's standard in your field and how serious a false positive would be. If you're running multiple tests, apply a correction like the Bonferroni correction (divide by the number of tests).
-
Calculate the p-value. The p-value is the probability of getting results as extreme as (or more extreme than) what you observed, assuming is true. Compute it using statistical software or reference tables.
-
Make a decision.
- If : reject . The data provide sufficient evidence for .
- If : fail to reject . The data don't provide enough evidence against it.
After deciding, interpret your result in context. A p-value is not the probability that is true, and it doesn't tell you how large or important the effect is.
Common Hypothesis Tests
Different scenarios call for different tests. Here are the ones you'll encounter most often.

Z-Test
Use a z-test when the population standard deviation () is known and you want to compare a sample mean to a hypothesized population mean. It also requires a large sample or a normally distributed population.
Where is the sample mean, is the hypothesized population mean, is the population standard deviation, and is the sample size.
T-Test
The t-test is far more common in practice because you rarely know . It uses the sample standard deviation () instead.
There are three main versions:
- One-sample t-test: compares a sample mean to a known value
- Independent samples t-test: compares means between two separate groups
- Paired samples t-test: compares means from the same group measured twice (before/after)
The t-distribution has heavier tails than the normal distribution, especially with small samples, which accounts for the extra uncertainty.
Chi-Square Test
The chi-square test works with categorical data (counts and frequencies), not means.
Where is the observed frequency and is the expected frequency. Two common versions:
- Goodness-of-fit test: checks whether observed frequencies match an expected distribution
- Test of independence: checks whether two categorical variables are related
One key assumption: expected frequencies in each category should be at least 5.
ANOVA
Analysis of Variance extends the t-test to compare means across three or more groups simultaneously. Instead of running multiple t-tests (which inflates Type I error), ANOVA uses a single F-test.
The F-ratio compares the variance between groups to the variance within groups. A large F-ratio suggests the group means aren't all equal.
- One-way ANOVA: one independent variable with multiple levels
- Two-way ANOVA: two independent variables (can test for interaction effects)
- Repeated measures ANOVA: same subjects measured under multiple conditions
Assumptions: normality, homogeneity of variance across groups, and independence of observations.
Assumptions and Limitations
Every statistical test rests on assumptions. Violating them can produce unreliable results, so you should always check before running a test.
Normality Assumption
Many parametric tests assume the data (or residuals) follow a normal distribution. You can assess this with:
- Visual methods: Q-Q plots, histograms
- Formal tests: Shapiro-Wilk test
The good news: most parametric tests are fairly robust to mild violations of normality, especially with larger samples. The Central Limit Theorem helps here, since sample means tend toward normality as increases regardless of the population shape. For severe violations, consider transforming the data (e.g., log transform) or switching to a non-parametric test.
Independence Assumption
Observations must be independent of one another. If one data point influences another, standard error estimates become unreliable.
This assumption is violated in repeated measures designs, clustered data (students within classrooms), or time-series data. Specialized methods like mixed-effects models can handle these situations. Proper random sampling and experimental design are the best ways to ensure independence from the start.
Homogeneity of Variance
Many tests (independent samples t-test, ANOVA) assume equal variances across groups. You can check this with Levene's test or by visually comparing spread in boxplots.
When variances are unequal:
- For two groups: use Welch's t-test, which adjusts the degrees of freedom
- For multiple groups: use Welch's ANOVA or apply a variance-stabilizing transformation
Violations of this assumption tend to inflate the Type I error rate, meaning you might reject more often than you should.
Interpreting Results
Getting a p-value is only half the job. Interpreting what it means in context is where the real thinking happens.

Statistical vs. Practical Significance
A result can be statistically significant but practically meaningless. With a large enough sample, even a tiny, trivial difference can produce . For example, a study with 100,000 participants might find that a new teaching method improves test scores by 0.1 points on a 100-point scale. That's statistically significant but not useful.
Always ask: Is this effect large enough to matter in the real world?
Confidence Intervals
A confidence interval gives a range of plausible values for the population parameter. A 95% confidence interval means that if you repeated the study many times, about 95% of the intervals you'd construct would contain the true parameter.
Confidence intervals complement p-values by showing the precision of your estimate. A narrow interval means your estimate is precise; a wide one means there's a lot of uncertainty. Reporting confidence intervals alongside p-values gives a much fuller picture than p-values alone.
Effect Size
Effect size measures the magnitude of an effect, independent of sample size. Common measures include:
- Cohen's d: standardized difference between two means. Small ≈ 0.2, medium ≈ 0.5, large ≈ 0.8.
- Pearson's r: correlation coefficient. Ranges from -1 to 1.
- Odds ratio: used in categorical data to compare the odds of an event between groups.
Effect sizes allow meaningful comparison across different studies and are essential for meta-analyses.
Advanced Concepts
These topics extend the basics and address some of the known limitations of standard hypothesis testing.
Multiple Comparisons Problem
Running many tests on the same dataset inflates your overall Type I error rate. If you run 20 tests at , you'd expect about 1 false positive even if nothing is going on.
Two ways to think about the problem:
- Family-wise error rate (FWER): the probability of making at least one Type I error across all tests
- False discovery rate (FDR): the expected proportion of rejected hypotheses that are false positives
Correction methods, from most to least conservative:
- Bonferroni: divide by the number of tests. Simple but very conservative.
- Holm-Bonferroni: a step-down procedure that's more powerful than Bonferroni while still controlling FWER.
- Benjamini-Hochberg: controls FDR instead of FWER. More powerful, widely used in fields with many simultaneous tests.
Power Analysis
Power analysis helps you plan studies that have a realistic chance of detecting the effect you're looking for.
The four quantities involved are interconnected: if you know any three, you can solve for the fourth.
- Significance level ()
- Power (, commonly set at 0.80)
- Effect size (how large the true effect is)
- Sample size ()
A priori power analysis (done before data collection) determines the sample size you need. Post hoc power analysis (done after) is sometimes used to interpret non-significant results, though its usefulness is debated. The biggest challenge is estimating a realistic effect size, since overly optimistic estimates lead to underpowered studies.
Bayesian Hypothesis Testing
The Bayesian approach offers an alternative to the frequentist framework covered above. Instead of asking "How likely are these data if is true?", it asks "How likely is given these data?"
Key ideas:
- You start with prior probabilities reflecting what you believed before seeing the data
- You update those beliefs using Bayes' theorem to get posterior probabilities
- Bayes factors quantify how much the data support one hypothesis over another (e.g., a Bayes factor of 10 means the data are 10 times more likely under than )
This framework allows you to accumulate evidence across studies and avoids some of the interpretive pitfalls of p-values. However, the choice of prior can influence results, which is both a strength (you can incorporate existing knowledge) and a criticism (priors can be subjective).
Applications in Research
Hypothesis testing doesn't exist in a vacuum. How you design your study, collect your data, and report your findings all affect whether your statistical conclusions are trustworthy.
Experimental Design
- Randomization assigns participants to groups by chance, controlling for confounding variables
- Blinding (single or double) reduces bias in data collection and interpretation
- Factorial designs let you examine multiple factors and their interactions simultaneously
- Use power analysis to determine sample size before starting
- Account for ethical constraints and practical limitations in your design
Data Collection Methods
- Follow standardized protocols so measurements are consistent
- Plan ahead for how you'll handle missing data and outliers
- Document every step of your data collection process for reproducibility
- Consider measurement error and how it might affect your analyses
Reporting Results
Clear reporting makes your work reproducible and credible. Best practices include:
- Report effect sizes and confidence intervals alongside p-values
- Describe the statistical methods and assumptions you used
- Be transparent about multiple comparisons and any subgroup analyses
- Acknowledge limitations and potential sources of bias
- Follow field-specific reporting guidelines (APA for psychology, CONSORT for clinical trials, STROBE for observational studies)