Hypothesis Testing Errors and Outcomes
Every hypothesis test ends with a decision: reject or fail to reject the null hypothesis. Since you're working with sample data (not the entire population), there's always a chance your decision is wrong. This section covers the two types of errors you can make, how their probabilities relate to each other, and what "power" actually means.
A helpful way to organize all four possible outcomes:
| is actually TRUE | is actually FALSE | |
|---|---|---|
| Reject | Type I Error (false positive) | Correct decision (true positive) |
| Fail to reject | Correct decision (true negative) | Type II Error (false negative) |
Keep this table in your head. Every hypothesis test lands in exactly one of these four cells.

Type I vs Type II Errors
Type I Error (False Positive) occurs when you reject the null hypothesis even though it's actually true. You're concluding that an effect exists when it doesn't.
- Denoted by the Greek letter alpha ()
- Example: A drug trial concludes a new medication is effective, but it actually has no effect. Patients receive unnecessary treatment based on a false finding.
Type II Error (False Negative) occurs when you fail to reject the null hypothesis even though it's actually false. You're missing a real effect.
- Denoted by the Greek letter beta ()
- Example: A screening test fails to detect cancer that is actually present. The patient goes untreated because the test missed the disease.
Correct decisions are the other two cells in the table:
- Rejecting when it really is false (true positive): correctly identifying that a drug works
- Failing to reject when it really is true (true negative): correctly concluding a healthy patient doesn't have a disease
The key distinction: Type I is a false alarm. Type II is a missed detection. Which one matters more depends on the context. In criminal trials, a Type I error means convicting an innocent person. In medical screening, a Type II error means missing a serious illness.
Probabilities of Hypothesis Testing Errors
Alpha () is the probability of making a Type I error. You choose this value before running the test. Common choices are and . A smaller alpha means you're requiring stronger evidence to reject , so false positives become less likely.
Beta () is the probability of making a Type II error. Unlike alpha, you don't directly set beta. It depends on several factors: your sample size, the true effect size, and the alpha level you chose.
Here's the tradeoff that trips people up: decreasing alpha generally increases beta (assuming everything else stays the same). Think of it this way. If you make your rejection criteria stricter (smaller ), you'll reject less often. That means fewer false positives, but you'll also miss more real effects, so goes up.
You can't minimize both errors simultaneously just by adjusting . The way to reduce both errors at once is to increase your sample size.

Power of the Test
Power is the probability of correctly rejecting a false null hypothesis. In other words, it's your test's ability to detect a real effect when one exists.
If , then power = 0.80, meaning there's an 80% chance you'll detect the effect. Researchers generally aim for power of at least 0.80.
Three main factors affect power:
- Sample size: Larger samples reduce sampling variability, making it easier to distinguish a real effect from random noise. This is the factor researchers have the most control over.
- Effect size: Larger true differences between the null value and reality are easier to detect. A drug that lowers blood pressure by 20 mmHg is much easier to detect than one that lowers it by 2 mmHg.
- Alpha level: A higher alpha (e.g., 0.05 vs. 0.01) gives you more power because you're using a less strict rejection threshold. But this comes at the cost of more Type I error risk.
Since power and are complements, increasing power directly decreases the probability of a Type II error. This is why researchers do power analyses before collecting data: to figure out how large a sample they need to have a reasonable chance of detecting the effect they're looking for.
Statistical Decision Making
- Test statistic: A value calculated from your sample data (like a z-score or t-score) that measures how far your sample result is from what predicts
- Critical value: The cutoff point on the distribution determined by your chosen . It marks the boundary of the rejection region.
- Decision rule: If the test statistic falls in the rejection region (beyond the critical value), you reject . Otherwise, you fail to reject.
- Statistical significance: When the test statistic exceeds the critical value, you have statistically significant evidence against . This means your result is unlikely to have occurred by chance alone, assuming is true.
- Confidence interval: A range of plausible values for the true population parameter. If the null hypothesis value falls outside your confidence interval, that's consistent with rejecting at the corresponding significance level.