🃏Engineering Probability

Key Concepts in Hypothesis Testing

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is the backbone of statistical inference in engineering—it's how you move from "I collected some data" to "I can make a defensible decision." Whether you're determining if a manufacturing process meets specifications, comparing two designs, or validating a model, you're running hypothesis tests. The core tension you're being tested on is the tradeoff between Type I errors, Type II errors, power, and sample size—these aren't isolated concepts but interconnected pieces of the same puzzle.

Don't just memorize definitions here. Every exam question about hypothesis testing is really asking: Do you understand the logic of statistical evidence? That means knowing when to use which test, how changing $\alpha$ ripples through your entire analysis, and why a p-value of 0.049 doesn't mean you've "proven" anything. Master the conceptual framework, and the formulas become tools rather than obstacles.

The Logical Framework: Hypotheses and Decisions

Before running any test, you need to set up the decision structure correctly. Hypothesis testing is fundamentally about quantifying evidence against a default assumption.

Null and Alternative Hypotheses

The null hypothesis ( $H_0$ )—represents no effect, no difference, or the status quo; this is what you assume true until evidence suggests otherwise
The alternative hypothesis ( $H_1$ or $H_a$ ) states what you're trying to demonstrate—the presence of an effect, difference, or relationship
The burden of proof falls on rejecting $H_0$ ; you never "accept" the null, only fail to reject it when evidence is insufficient

One-Tailed and Two-Tailed Tests

One-tailed tests concentrate all of $\alpha$ in one direction—use when you only care if a parameter is greater than or less than a value, not both
Two-tailed tests split $\alpha$ between both tails—appropriate when any difference from $H_0$ matters, regardless of direction
Choosing incorrectly affects your critical values and p-value interpretation; FRQs often test whether you can justify your choice

Compare: One-tailed vs. two-tailed tests—both use the same test statistic, but one-tailed tests have more power to detect effects in the specified direction at the cost of ignoring effects in the opposite direction. If an FRQ gives you a directional research question ("Is the new process faster?"), that's your cue for one-tailed.

Error Types and Tradeoffs

The heart of hypothesis testing is managing uncertainty. Every decision carries risk, and your job is to control which risks you're willing to accept.

Type I and Type II Errors

Type I error ( $\alpha$ )—rejecting $H_0$ when it's actually true; a "false positive" that leads you to claim an effect that doesn't exist
Type II error ( $\beta$ )—failing to reject $H_0$ when it's actually false; a "false negative" that causes you to miss a real effect
The fundamental tradeoff means reducing $\alpha$ increases $\beta$ for a fixed sample size—you can't minimize both without collecting more data

Significance Level ( $\alpha$ )

The significance level is your pre-set threshold for rejecting $H_0$ , typically $\alpha = 0.05$ or $\alpha = 0.01$ in engineering applications
It equals the probability of committing a Type I error—if $H_0$ is true, you'll incorrectly reject it $\alpha \times 100\%$ of the time
Choosing $\alpha$ depends on consequences; use smaller values when false positives are costly (e.g., approving a faulty safety system)

Power of a Test

Power equals $1 - \beta$ —the probability of correctly rejecting $H_0$ when the alternative is true; higher power means better detection
Three levers increase power: larger sample size $n$ , larger true effect size, and higher significance level $\alpha$
Target power of 0.80 or higher is standard; underpowered studies waste resources by being unlikely to detect real effects

Compare: Type I vs. Type II errors—both are mistakes, but Type I means acting on a false signal while Type II means missing a true signal. In quality control, Type I might mean halting production unnecessarily; Type II might mean shipping defective products. Know which matters more for your context.

Measuring Evidence: Test Statistics and P-Values

Once your framework is set, you need to quantify how surprising your data is. The test statistic transforms raw data into a standardized measure of evidence.

Test Statistic

A test statistic is a standardized value (like $Z$ , $T$ , $\chi^2$ , or $F$ ) that measures how far your sample result falls from what $H_0$ predicts
The formula depends on your test but always compares observed data to expected values under the null, scaled by variability
Larger absolute values indicate stronger evidence against $H_0$ ; the distribution of the statistic determines what "large" means

Critical Region

The critical region (or rejection region) contains test statistic values that lead to rejecting $H_0$ —determined by $\alpha$ and the test's distribution
For a two-tailed test at $\alpha = 0.05$ , the critical region is the most extreme 2.5% in each tail
Decision rule: if your calculated test statistic falls in the critical region, reject $H_0$ ; otherwise, fail to reject

P-Value

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) yours, assuming $H_0$ is true
Compare to $\alpha$ : if $p < \alpha$ , reject $H_0$ ; if $p \geq \alpha$ , fail to reject—this is equivalent to the critical region approach
Critical misconception: the p-value is not the probability that $H_0$ is true; it's a measure of data compatibility with $H_0$

Compare: Critical region approach vs. p-value approach—both give identical decisions, but p-values provide more granular information about evidence strength. Exams may ask you to use either method; know that $p < \alpha$ is equivalent to the test statistic falling in the critical region.

Choosing the Right Test: Parameters and Distributions

Different parameters require different test statistics. The choice depends on what you're testing, what you know, and what assumptions hold.

Z-Test

Use when population variance $\sigma^2$ is known or when $n > 30$ (Central Limit Theorem justifies normal approximation)
Test statistic: $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ follows the standard normal distribution under $H_0$
Most common application is testing a population mean against a hypothesized value with large samples

T-Test

Use when population variance is unknown and you're estimating it with sample variance $s^2$ , especially for small samples ( $n \leq 30$ )
Test statistic: $T = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}$ follows a t-distribution with $n - 1$ degrees of freedom
Variants include one-sample (one mean vs. value), independent two-sample (comparing two means), and paired (matched observations)

Chi-Square Test

Use for categorical data (goodness-of-fit, independence) or for testing population variance
Test statistic: $\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$ for categorical tests; $\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}$ for variance tests
Assumes normality for variance testing; for categorical tests, expected frequencies should generally be $\geq 5$

F-Test

Use to compare two population variances or in ANOVA to compare means across multiple groups
Test statistic: $F = \frac{s_1^2}{s_2^2}$ follows an F-distribution with $(n_1 - 1, n_2 - 1)$ degrees of freedom
Always place the larger variance in the numerator for a right-tailed test; highly sensitive to non-normality

Compare: Z-test vs. T-test—both test means, but Z requires known $\sigma$ while T estimates it from data. As $n \to \infty$ , the t-distribution approaches the standard normal, so they converge for large samples. When in doubt with unknown variance, use T.

Testing Specific Parameters

Each type of parameter has its own testing procedure. Match the parameter to the appropriate test and verify assumptions.

Hypothesis Testing for Population Mean

Setup: $H_0: \mu = \mu_0$ vs. $H_1: \mu \neq \mu_0$ (or one-tailed alternatives $\mu > \mu_0$ or $\mu < \mu_0$ )
Use Z-test if $\sigma$ known or $n$ large; use T-test if $\sigma$ unknown and $n$ small
Assumption: data are normally distributed, or $n$ is large enough for CLT to apply (typically $n \geq 30$ )

Hypothesis Testing for Population Proportion

Setup: $H_0: p = p_0$ vs. appropriate alternative; used for binary outcomes (success/failure)
Test statistic: $Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$ where $\hat{p}$ is the sample proportion
Validity requires $np_0 \geq 10$ and $n(1-p_0) \geq 10$ for the normal approximation to the binomial

Hypothesis Testing for Population Variance

Setup: $H_0: \sigma^2 = \sigma_0^2$ vs. alternatives about variance being different, larger, or smaller
Test statistic: $\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}$ follows a chi-square distribution with $n-1$ degrees of freedom
Strongly assumes normality—this test is not robust to departures, unlike mean tests with large samples

Compare: Testing means vs. testing variances—mean tests (Z, T) are robust to mild non-normality for large $n$ , but variance tests ( $\chi^2$ ) are highly sensitive to the normality assumption. Always check distributional assumptions more carefully for variance tests.

Confidence Intervals and Sample Size

Confidence intervals and hypothesis tests are two sides of the same coin. A 95% CI contains all parameter values you wouldn't reject at $\alpha = 0.05$ .

Confidence Intervals

A $(1-\alpha) \times 100\%$ confidence interval provides a range of plausible values for the parameter based on sample data
Interpretation: if you repeated the sampling process many times, $(1-\alpha) \times 100\%$ of constructed intervals would contain the true parameter
Connection to testing: if $\mu_0$ falls outside a 95% CI for $\mu$ , you'd reject $H_0: \mu = \mu_0$ at $\alpha = 0.05$

Sample Size Determination

Formula for means: $n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2$ where $E$ is the desired margin of error
For specified power: sample size depends on $\alpha$ , desired power $1-\beta$ , and the effect size you want to detect
Larger $n$ always helps—reduces standard error, narrows confidence intervals, and increases power

Compare: Confidence intervals vs. hypothesis tests—CIs give a range of plausible values while tests give a binary decision. CIs are often more informative because they show effect magnitude, not just statistical significance. Many journals now require both.

Advanced Testing Methods

Some situations require more sophisticated approaches. These methods handle complex models, multiple comparisons, and distribution-free inference.

Likelihood Ratio Test

Compares nested models by computing $\Lambda = \frac{L(\hat{\theta}_0)}{L(\hat{\theta})}$ , the ratio of maximized likelihoods under $H_0$ and $H_1$
Test statistic: $-2 \ln \Lambda$ asymptotically follows $\chi^2$ with degrees of freedom equal to the difference in parameters
Powerful and general—works for complex models where simpler tests don't apply

Multiple Hypothesis Testing

Problem: testing $m$ hypotheses at $\alpha = 0.05$ each gives family-wise error rate $\approx 1 - (1-0.05)^m$ , which grows quickly
Bonferroni correction uses $\alpha/m$ for each test—simple but conservative, especially for large $m$
Alternative methods include Holm's procedure (less conservative) and false discovery rate control (for genomics, imaging)

Non-Parametric Tests

Use when normality assumptions fail and transformations don't help, or when data are ordinal rather than interval
Examples: Mann-Whitney U (compares two groups), Kruskal-Wallis (compares multiple groups), Wilcoxon signed-rank (paired data)
Trade-off: fewer assumptions but generally less power than parametric tests when parametric assumptions hold

Compare: Parametric vs. non-parametric tests—parametric tests (Z, T, F) assume specific distributions and are more powerful when assumptions hold. Non-parametric tests are safer when assumptions are violated but sacrifice power. Check assumptions first; don't default to non-parametric out of laziness.

Quick Reference Table

Concept	Best Examples
Error tradeoffs	Type I/II errors, significance level, power
Evidence quantification	Test statistic, p-value, critical region
Testing means	Z-test, T-test, hypothesis testing for population mean
Testing other parameters	Chi-square (variance), Z-test for proportions, F-test (comparing variances)
Test selection criteria	One-tailed vs. two-tailed, sample size, known vs. unknown variance
Interval estimation	Confidence intervals, sample size determination
Advanced methods	Likelihood ratio test, multiple testing corrections, non-parametric tests

Self-Check Questions

You're testing whether a new algorithm is faster than the current one. Should you use a one-tailed or two-tailed test, and why does this choice affect your power?
A colleague reports $p = 0.03$ and concludes there's a 3% chance the null hypothesis is true. What's wrong with this interpretation, and what does the p-value actually measure?
Compare the Z-test and T-test: under what conditions do they give nearly identical results, and when does the choice between them matter most?
If you decrease $\alpha$ from 0.05 to 0.01 while keeping sample size fixed, what happens to (a) Type I error probability, (b) Type II error probability, and (c) power? Explain the tradeoff.
You need to compare failure rates across five different manufacturing processes. Why would using five separate pairwise tests at $\alpha = 0.05$ be problematic, and what correction would you apply?