Hypothesis Testing and Rare Events
Hypothesis testing gives you a structured way to decide whether sample data provides enough evidence to reject a claim about a population. The core logic works like this: assume nothing interesting is happening (the null hypothesis), then ask how likely your observed data would be under that assumption. If the data would be very unlikely, that's a rare event, and it gives you reason to doubt the null hypothesis.

Rare Events in Hypothesis Testing
A rare event is an outcome that has a low probability of occurring if the null hypothesis is true. This idea drives the entire decision-making process in hypothesis testing.
- The null hypothesis () is the default assumption that there is no significant effect or difference. If is true, you'd expect your sample data to look consistent with it.
- The alternative hypothesis ( or ) is the claim that contradicts . When you observe data that would be rare under , that evidence points toward .
- The p-value quantifies how rare your result is. It's the probability of observing your sample data (or something more extreme) assuming is true. A small p-value means your observed result would be unusual if were correct, which suggests may be false.
Think of it this way: if you flip a coin 100 times and get 92 heads, that result would be extremely rare if the coin were fair. The low probability of that outcome (the small p-value) is what makes you suspect the coin isn't fair.

P-Values and Significance Levels
The significance level () is the threshold you set before collecting data. It defines how rare a result needs to be before you'll reject . Common choices are 0.01, 0.05, and 0.10.
Calculating and using the p-value:
- Compute the test statistic from your sample data (the specific formula depends on the type of test).
- Find the p-value: the probability of getting a test statistic as extreme as (or more extreme than) what you observed, assuming is true.
- Compare the p-value to :
- If , reject .
- If , fail to reject .
The significance level must be chosen before you look at the data. Changing after seeing your p-value undermines the entire framework.

Interpretation of Hypothesis Test Results
Getting the decision right is only half the job. You also need to state what the decision means in context.
When you reject :
- You conclude there is sufficient evidence to support .
- The observed result is statistically significant at the chosen level.
- This does not prove is true. It means the data would be unlikely if were true, so you act as though is false.
When you fail to reject :
- You conclude there is insufficient evidence to support .
- The observed result could plausibly be due to chance or sampling variability.
- This does not prove is true. You simply don't have strong enough evidence to reject it. The phrasing matters: never say you "accept" .
Context and practical significance:
Statistical significance doesn't automatically mean the result matters in the real world. A drug trial might find a statistically significant blood pressure reduction of 0.5 mmHg, but that's too small to be clinically meaningful. Always consider the effect size (the magnitude of the difference or relationship) alongside the p-value when drawing conclusions.
Sampling Distribution and Confidence Intervals
- The sampling distribution is the theoretical distribution of a statistic (like ) you'd get from taking many repeated samples of the same size from the same population. It's what allows you to calculate p-values in the first place.
- A confidence interval gives a range of plausible values for the true population parameter. For example, a 95% confidence interval means that if you repeated the sampling process many times, about 95% of the intervals constructed would contain the true parameter. Confidence intervals and hypothesis tests are closely related: if a 95% confidence interval for a mean does not contain the value in , you would reject at .
- Power is the probability of correctly rejecting a false . Higher power means you're less likely to miss a real effect. Power increases with larger sample sizes, larger true effect sizes, and higher levels. A power analysis before collecting data helps you determine the sample size needed to detect an effect of a given size.