upgrade
upgrade

🃏Engineering Probability

Key Concepts in Hypothesis Testing

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is the backbone of statistical inference in engineering—it's how you move from "I collected some data" to "I can make a defensible decision." Whether you're determining if a manufacturing process meets specifications, comparing two designs, or validating a model, you're running hypothesis tests. The core tension you're being tested on is the tradeoff between Type I errors, Type II errors, power, and sample size—these aren't isolated concepts but interconnected pieces of the same puzzle.

Don't just memorize definitions here. Every exam question about hypothesis testing is really asking: Do you understand the logic of statistical evidence? That means knowing when to use which test, how changing α\alpha ripples through your entire analysis, and why a p-value of 0.049 doesn't mean you've "proven" anything. Master the conceptual framework, and the formulas become tools rather than obstacles.


The Logical Framework: Hypotheses and Decisions

Before running any test, you need to set up the decision structure correctly. Hypothesis testing is fundamentally about quantifying evidence against a default assumption.

Null and Alternative Hypotheses

  • The null hypothesis (H0H_0)—represents no effect, no difference, or the status quo; this is what you assume true until evidence suggests otherwise
  • The alternative hypothesis (H1H_1 or HaH_a) states what you're trying to demonstrate—the presence of an effect, difference, or relationship
  • The burden of proof falls on rejecting H0H_0; you never "accept" the null, only fail to reject it when evidence is insufficient

One-Tailed and Two-Tailed Tests

  • One-tailed tests concentrate all of α\alpha in one direction—use when you only care if a parameter is greater than or less than a value, not both
  • Two-tailed tests split α\alpha between both tails—appropriate when any difference from H0H_0 matters, regardless of direction
  • Choosing incorrectly affects your critical values and p-value interpretation; FRQs often test whether you can justify your choice

Compare: One-tailed vs. two-tailed tests—both use the same test statistic, but one-tailed tests have more power to detect effects in the specified direction at the cost of ignoring effects in the opposite direction. If an FRQ gives you a directional research question ("Is the new process faster?"), that's your cue for one-tailed.


Error Types and Tradeoffs

The heart of hypothesis testing is managing uncertainty. Every decision carries risk, and your job is to control which risks you're willing to accept.

Type I and Type II Errors

  • Type I error (α\alpha)—rejecting H0H_0 when it's actually true; a "false positive" that leads you to claim an effect that doesn't exist
  • Type II error (β\beta)—failing to reject H0H_0 when it's actually false; a "false negative" that causes you to miss a real effect
  • The fundamental tradeoff means reducing α\alpha increases β\beta for a fixed sample size—you can't minimize both without collecting more data

Significance Level (α\alpha)

  • The significance level is your pre-set threshold for rejecting H0H_0, typically α=0.05\alpha = 0.05 or α=0.01\alpha = 0.01 in engineering applications
  • It equals the probability of committing a Type I error—if H0H_0 is true, you'll incorrectly reject it α×100%\alpha \times 100\% of the time
  • Choosing α\alpha depends on consequences; use smaller values when false positives are costly (e.g., approving a faulty safety system)

Power of a Test

  • Power equals 1β1 - \beta—the probability of correctly rejecting H0H_0 when the alternative is true; higher power means better detection
  • Three levers increase power: larger sample size nn, larger true effect size, and higher significance level α\alpha
  • Target power of 0.80 or higher is standard; underpowered studies waste resources by being unlikely to detect real effects

Compare: Type I vs. Type II errors—both are mistakes, but Type I means acting on a false signal while Type II means missing a true signal. In quality control, Type I might mean halting production unnecessarily; Type II might mean shipping defective products. Know which matters more for your context.


Measuring Evidence: Test Statistics and P-Values

Once your framework is set, you need to quantify how surprising your data is. The test statistic transforms raw data into a standardized measure of evidence.

Test Statistic

  • A test statistic is a standardized value (like ZZ, TT, χ2\chi^2, or FF) that measures how far your sample result falls from what H0H_0 predicts
  • The formula depends on your test but always compares observed data to expected values under the null, scaled by variability
  • Larger absolute values indicate stronger evidence against H0H_0; the distribution of the statistic determines what "large" means

Critical Region

  • The critical region (or rejection region) contains test statistic values that lead to rejecting H0H_0—determined by α\alpha and the test's distribution
  • For a two-tailed test at α=0.05\alpha = 0.05, the critical region is the most extreme 2.5% in each tail
  • Decision rule: if your calculated test statistic falls in the critical region, reject H0H_0; otherwise, fail to reject

P-Value

  • The p-value is the probability of observing a test statistic as extreme as (or more extreme than) yours, assuming H0H_0 is true
  • Compare to α\alpha: if p<αp < \alpha, reject H0H_0; if pαp \geq \alpha, fail to reject—this is equivalent to the critical region approach
  • Critical misconception: the p-value is not the probability that H0H_0 is true; it's a measure of data compatibility with H0H_0

Compare: Critical region approach vs. p-value approach—both give identical decisions, but p-values provide more granular information about evidence strength. Exams may ask you to use either method; know that p<αp < \alpha is equivalent to the test statistic falling in the critical region.


Choosing the Right Test: Parameters and Distributions

Different parameters require different test statistics. The choice depends on what you're testing, what you know, and what assumptions hold.

Z-Test

  • Use when population variance σ2\sigma^2 is known or when n>30n > 30 (Central Limit Theorem justifies normal approximation)
  • Test statistic: Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} follows the standard normal distribution under H0H_0
  • Most common application is testing a population mean against a hypothesized value with large samples

T-Test

  • Use when population variance is unknown and you're estimating it with sample variance s2s^2, especially for small samples (n30n \leq 30)
  • Test statistic: T=Xˉμ0s/nT = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} follows a t-distribution with n1n - 1 degrees of freedom
  • Variants include one-sample (one mean vs. value), independent two-sample (comparing two means), and paired (matched observations)

Chi-Square Test

  • Use for categorical data (goodness-of-fit, independence) or for testing population variance
  • Test statistic: χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} for categorical tests; χ2=(n1)s2σ02\chi^2 = \frac{(n-1)s^2}{\sigma_0^2} for variance tests
  • Assumes normality for variance testing; for categorical tests, expected frequencies should generally be 5\geq 5

F-Test

  • Use to compare two population variances or in ANOVA to compare means across multiple groups
  • Test statistic: F=s12s22F = \frac{s_1^2}{s_2^2} follows an F-distribution with (n11,n21)(n_1 - 1, n_2 - 1) degrees of freedom
  • Always place the larger variance in the numerator for a right-tailed test; highly sensitive to non-normality

Compare: Z-test vs. T-test—both test means, but Z requires known σ\sigma while T estimates it from data. As nn \to \infty, the t-distribution approaches the standard normal, so they converge for large samples. When in doubt with unknown variance, use T.


Testing Specific Parameters

Each type of parameter has its own testing procedure. Match the parameter to the appropriate test and verify assumptions.

Hypothesis Testing for Population Mean

  • Setup: H0:μ=μ0H_0: \mu = \mu_0 vs. H1:μμ0H_1: \mu \neq \mu_0 (or one-tailed alternatives μ>μ0\mu > \mu_0 or μ<μ0\mu < \mu_0)
  • Use Z-test if σ\sigma known or nn large; use T-test if σ\sigma unknown and nn small
  • Assumption: data are normally distributed, or nn is large enough for CLT to apply (typically n30n \geq 30)

Hypothesis Testing for Population Proportion

  • Setup: H0:p=p0H_0: p = p_0 vs. appropriate alternative; used for binary outcomes (success/failure)
  • Test statistic: Z=p^p0p0(1p0)/nZ = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} where p^\hat{p} is the sample proportion
  • Validity requires np010np_0 \geq 10 and n(1p0)10n(1-p_0) \geq 10 for the normal approximation to the binomial

Hypothesis Testing for Population Variance

  • Setup: H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 vs. alternatives about variance being different, larger, or smaller
  • Test statistic: χ2=(n1)s2σ02\chi^2 = \frac{(n-1)s^2}{\sigma_0^2} follows a chi-square distribution with n1n-1 degrees of freedom
  • Strongly assumes normality—this test is not robust to departures, unlike mean tests with large samples

Compare: Testing means vs. testing variances—mean tests (Z, T) are robust to mild non-normality for large nn, but variance tests (χ2\chi^2) are highly sensitive to the normality assumption. Always check distributional assumptions more carefully for variance tests.


Confidence Intervals and Sample Size

Confidence intervals and hypothesis tests are two sides of the same coin. A 95% CI contains all parameter values you wouldn't reject at α=0.05\alpha = 0.05.

Confidence Intervals

  • A (1α)×100%(1-\alpha) \times 100\% confidence interval provides a range of plausible values for the parameter based on sample data
  • Interpretation: if you repeated the sampling process many times, (1α)×100%(1-\alpha) \times 100\% of constructed intervals would contain the true parameter
  • Connection to testing: if μ0\mu_0 falls outside a 95% CI for μ\mu, you'd reject H0:μ=μ0H_0: \mu = \mu_0 at α=0.05\alpha = 0.05

Sample Size Determination

  • Formula for means: n=(zα/2σE)2n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2 where EE is the desired margin of error
  • For specified power: sample size depends on α\alpha, desired power 1β1-\beta, and the effect size you want to detect
  • Larger nn always helps—reduces standard error, narrows confidence intervals, and increases power

Compare: Confidence intervals vs. hypothesis tests—CIs give a range of plausible values while tests give a binary decision. CIs are often more informative because they show effect magnitude, not just statistical significance. Many journals now require both.


Advanced Testing Methods

Some situations require more sophisticated approaches. These methods handle complex models, multiple comparisons, and distribution-free inference.

Likelihood Ratio Test

  • Compares nested models by computing Λ=L(θ^0)L(θ^)\Lambda = \frac{L(\hat{\theta}_0)}{L(\hat{\theta})}, the ratio of maximized likelihoods under H0H_0 and H1H_1
  • Test statistic: 2lnΛ-2 \ln \Lambda asymptotically follows χ2\chi^2 with degrees of freedom equal to the difference in parameters
  • Powerful and general—works for complex models where simpler tests don't apply

Multiple Hypothesis Testing

  • Problem: testing mm hypotheses at α=0.05\alpha = 0.05 each gives family-wise error rate 1(10.05)m\approx 1 - (1-0.05)^m, which grows quickly
  • Bonferroni correction uses α/m\alpha/m for each test—simple but conservative, especially for large mm
  • Alternative methods include Holm's procedure (less conservative) and false discovery rate control (for genomics, imaging)

Non-Parametric Tests

  • Use when normality assumptions fail and transformations don't help, or when data are ordinal rather than interval
  • Examples: Mann-Whitney U (compares two groups), Kruskal-Wallis (compares multiple groups), Wilcoxon signed-rank (paired data)
  • Trade-off: fewer assumptions but generally less power than parametric tests when parametric assumptions hold

Compare: Parametric vs. non-parametric tests—parametric tests (Z, T, F) assume specific distributions and are more powerful when assumptions hold. Non-parametric tests are safer when assumptions are violated but sacrifice power. Check assumptions first; don't default to non-parametric out of laziness.


Quick Reference Table

ConceptBest Examples
Error tradeoffsType I/II errors, significance level, power
Evidence quantificationTest statistic, p-value, critical region
Testing meansZ-test, T-test, hypothesis testing for population mean
Testing other parametersChi-square (variance), Z-test for proportions, F-test (comparing variances)
Test selection criteriaOne-tailed vs. two-tailed, sample size, known vs. unknown variance
Interval estimationConfidence intervals, sample size determination
Advanced methodsLikelihood ratio test, multiple testing corrections, non-parametric tests

Self-Check Questions

  1. You're testing whether a new algorithm is faster than the current one. Should you use a one-tailed or two-tailed test, and why does this choice affect your power?

  2. A colleague reports p=0.03p = 0.03 and concludes there's a 3% chance the null hypothesis is true. What's wrong with this interpretation, and what does the p-value actually measure?

  3. Compare the Z-test and T-test: under what conditions do they give nearly identical results, and when does the choice between them matter most?

  4. If you decrease α\alpha from 0.05 to 0.01 while keeping sample size fixed, what happens to (a) Type I error probability, (b) Type II error probability, and (c) power? Explain the tradeoff.

  5. You need to compare failure rates across five different manufacturing processes. Why would using five separate pairwise tests at α=0.05\alpha = 0.05 be problematic, and what correction would you apply?