Hypothesis testing is the backbone of statistical inference in engineering—it's how you move from "I collected some data" to "I can make a defensible decision." Whether you're determining if a manufacturing process meets specifications, comparing two designs, or validating a model, you're running hypothesis tests. The core tension you're being tested on is the tradeoff between Type I errors, Type II errors, power, and sample size—these aren't isolated concepts but interconnected pieces of the same puzzle.
Don't just memorize definitions here. Every exam question about hypothesis testing is really asking: Do you understand the logic of statistical evidence? That means knowing when to use which test, how changing α ripples through your entire analysis, and why a p-value of 0.049 doesn't mean you've "proven" anything. Master the conceptual framework, and the formulas become tools rather than obstacles.
The Logical Framework: Hypotheses and Decisions
Before running any test, you need to set up the decision structure correctly. Hypothesis testing is fundamentally about quantifying evidence against a default assumption.
Null and Alternative Hypotheses
The null hypothesis (H0)—represents no effect, no difference, or the status quo; this is what you assume true until evidence suggests otherwise
The alternative hypothesis (H1 or Ha) states what you're trying to demonstrate—the presence of an effect, difference, or relationship
The burden of proof falls on rejecting H0; you never "accept" the null, only fail to reject it when evidence is insufficient
One-Tailed and Two-Tailed Tests
One-tailed tests concentrate all of α in one direction—use when you only care if a parameter is greater than or less than a value, not both
Two-tailed tests split α between both tails—appropriate when any difference from H0 matters, regardless of direction
Choosing incorrectly affects your critical values and p-value interpretation; FRQs often test whether you can justify your choice
Compare: One-tailed vs. two-tailed tests—both use the same test statistic, but one-tailed tests have more power to detect effects in the specified direction at the cost of ignoring effects in the opposite direction. If an FRQ gives you a directional research question ("Is the new process faster?"), that's your cue for one-tailed.
Error Types and Tradeoffs
The heart of hypothesis testing is managing uncertainty. Every decision carries risk, and your job is to control which risks you're willing to accept.
Type I and Type II Errors
Type I error (α)—rejecting H0 when it's actually true; a "false positive" that leads you to claim an effect that doesn't exist
Type II error (β)—failing to reject H0 when it's actually false; a "false negative" that causes you to miss a real effect
The fundamental tradeoff means reducing α increases β for a fixed sample size—you can't minimize both without collecting more data
Significance Level (α)
The significance level is your pre-set threshold for rejecting H0, typically α=0.05 or α=0.01 in engineering applications
It equals the probability of committing a Type I error—if H0 is true, you'll incorrectly reject it α×100% of the time
Choosing α depends on consequences; use smaller values when false positives are costly (e.g., approving a faulty safety system)
Power of a Test
Power equals 1−β—the probability of correctly rejecting H0 when the alternative is true; higher power means better detection
Three levers increase power: larger sample size n, larger true effect size, and higher significance level α
Target power of 0.80 or higher is standard; underpowered studies waste resources by being unlikely to detect real effects
Compare: Type I vs. Type II errors—both are mistakes, but Type I means acting on a false signal while Type II means missing a true signal. In quality control, Type I might mean halting production unnecessarily; Type II might mean shipping defective products. Know which matters more for your context.
Measuring Evidence: Test Statistics and P-Values
Once your framework is set, you need to quantify how surprising your data is. The test statistic transforms raw data into a standardized measure of evidence.
Test Statistic
A test statistic is a standardized value (like Z, T, χ2, or F) that measures how far your sample result falls from what H0 predicts
The formula depends on your test but always compares observed data to expected values under the null, scaled by variability
Larger absolute values indicate stronger evidence against H0; the distribution of the statistic determines what "large" means
Critical Region
The critical region (or rejection region) contains test statistic values that lead to rejecting H0—determined by α and the test's distribution
For a two-tailed test at α=0.05, the critical region is the most extreme 2.5% in each tail
Decision rule: if your calculated test statistic falls in the critical region, reject H0; otherwise, fail to reject
P-Value
The p-value is the probability of observing a test statistic as extreme as (or more extreme than) yours, assuming H0 is true
Compare to α: if p<α, reject H0; if p≥α, fail to reject—this is equivalent to the critical region approach
Critical misconception: the p-value is not the probability that H0 is true; it's a measure of data compatibility with H0
Compare: Critical region approach vs. p-value approach—both give identical decisions, but p-values provide more granular information about evidence strength. Exams may ask you to use either method; know that p<α is equivalent to the test statistic falling in the critical region.
Choosing the Right Test: Parameters and Distributions
Different parameters require different test statistics. The choice depends on what you're testing, what you know, and what assumptions hold.
Z-Test
Use when population variance σ2 is known or when n>30 (Central Limit Theorem justifies normal approximation)
Test statistic:Z=σ/nXˉ−μ0 follows the standard normal distribution under H0
Most common application is testing a population mean against a hypothesized value with large samples
T-Test
Use when population variance is unknown and you're estimating it with sample variance s2, especially for small samples (n≤30)
Test statistic:T=s/nXˉ−μ0 follows a t-distribution with n−1 degrees of freedom
Variants include one-sample (one mean vs. value), independent two-sample (comparing two means), and paired (matched observations)
Chi-Square Test
Use for categorical data (goodness-of-fit, independence) or for testing population variance
Test statistic:χ2=∑Ei(Oi−Ei)2 for categorical tests; χ2=σ02(n−1)s2 for variance tests
Assumes normality for variance testing; for categorical tests, expected frequencies should generally be ≥5
F-Test
Use to compare two population variances or in ANOVA to compare means across multiple groups
Test statistic:F=s22s12 follows an F-distribution with (n1−1,n2−1) degrees of freedom
Always place the larger variance in the numerator for a right-tailed test; highly sensitive to non-normality
Compare: Z-test vs. T-test—both test means, but Z requires known σ while T estimates it from data. As n→∞, the t-distribution approaches the standard normal, so they converge for large samples. When in doubt with unknown variance, use T.
Testing Specific Parameters
Each type of parameter has its own testing procedure. Match the parameter to the appropriate test and verify assumptions.
Hypothesis Testing for Population Mean
Setup:H0:μ=μ0 vs. H1:μ=μ0 (or one-tailed alternatives μ>μ0 or μ<μ0)
Use Z-test if σ known or n large; use T-test if σ unknown and n small
Assumption: data are normally distributed, or n is large enough for CLT to apply (typically n≥30)
Hypothesis Testing for Population Proportion
Setup:H0:p=p0 vs. appropriate alternative; used for binary outcomes (success/failure)
Test statistic:Z=p0(1−p0)/np^−p0 where p^ is the sample proportion
Validity requiresnp0≥10 and n(1−p0)≥10 for the normal approximation to the binomial
Hypothesis Testing for Population Variance
Setup:H0:σ2=σ02 vs. alternatives about variance being different, larger, or smaller
Test statistic:χ2=σ02(n−1)s2 follows a chi-square distribution with n−1 degrees of freedom
Strongly assumes normality—this test is not robust to departures, unlike mean tests with large samples
Compare: Testing means vs. testing variances—mean tests (Z, T) are robust to mild non-normality for large n, but variance tests (χ2) are highly sensitive to the normality assumption. Always check distributional assumptions more carefully for variance tests.
Confidence Intervals and Sample Size
Confidence intervals and hypothesis tests are two sides of the same coin. A 95% CI contains all parameter values you wouldn't reject at α=0.05.
Confidence Intervals
A (1−α)×100% confidence interval provides a range of plausible values for the parameter based on sample data
Interpretation: if you repeated the sampling process many times, (1−α)×100% of constructed intervals would contain the true parameter
Connection to testing: if μ0 falls outside a 95% CI for μ, you'd reject H0:μ=μ0 at α=0.05
Sample Size Determination
Formula for means:n=(Ezα/2⋅σ)2 where E is the desired margin of error
For specified power: sample size depends on α, desired power 1−β, and the effect size you want to detect
Larger n always helps—reduces standard error, narrows confidence intervals, and increases power
Compare: Confidence intervals vs. hypothesis tests—CIs give a range of plausible values while tests give a binary decision. CIs are often more informative because they show effect magnitude, not just statistical significance. Many journals now require both.
Advanced Testing Methods
Some situations require more sophisticated approaches. These methods handle complex models, multiple comparisons, and distribution-free inference.
Likelihood Ratio Test
Compares nested models by computing Λ=L(θ^)L(θ^0), the ratio of maximized likelihoods under H0 and H1
Test statistic:−2lnΛ asymptotically follows χ2 with degrees of freedom equal to the difference in parameters
Powerful and general—works for complex models where simpler tests don't apply
Multiple Hypothesis Testing
Problem: testing m hypotheses at α=0.05 each gives family-wise error rate ≈1−(1−0.05)m, which grows quickly
Bonferroni correction uses α/m for each test—simple but conservative, especially for large m
Alternative methods include Holm's procedure (less conservative) and false discovery rate control (for genomics, imaging)
Non-Parametric Tests
Use when normality assumptions fail and transformations don't help, or when data are ordinal rather than interval
Examples: Mann-Whitney U (compares two groups), Kruskal-Wallis (compares multiple groups), Wilcoxon signed-rank (paired data)
Trade-off: fewer assumptions but generally less power than parametric tests when parametric assumptions hold
Compare: Parametric vs. non-parametric tests—parametric tests (Z, T, F) assume specific distributions and are more powerful when assumptions hold. Non-parametric tests are safer when assumptions are violated but sacrifice power. Check assumptions first; don't default to non-parametric out of laziness.
Quick Reference Table
Concept
Best Examples
Error tradeoffs
Type I/II errors, significance level, power
Evidence quantification
Test statistic, p-value, critical region
Testing means
Z-test, T-test, hypothesis testing for population mean
Testing other parameters
Chi-square (variance), Z-test for proportions, F-test (comparing variances)
Test selection criteria
One-tailed vs. two-tailed, sample size, known vs. unknown variance
Interval estimation
Confidence intervals, sample size determination
Advanced methods
Likelihood ratio test, multiple testing corrections, non-parametric tests
Self-Check Questions
You're testing whether a new algorithm is faster than the current one. Should you use a one-tailed or two-tailed test, and why does this choice affect your power?
A colleague reports p=0.03 and concludes there's a 3% chance the null hypothesis is true. What's wrong with this interpretation, and what does the p-value actually measure?
Compare the Z-test and T-test: under what conditions do they give nearly identical results, and when does the choice between them matter most?
If you decrease α from 0.05 to 0.01 while keeping sample size fixed, what happens to (a) Type I error probability, (b) Type II error probability, and (c) power? Explain the tradeoff.
You need to compare failure rates across five different manufacturing processes. Why would using five separate pairwise tests at α=0.05 be problematic, and what correction would you apply?