Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Statistical fallacies are the traps that turn good data into bad conclusions. These errors show up everywhere: in research studies, news headlines, business decisions, and on your exams. Understanding them isn't just about avoiding mistakes. It's about demonstrating mastery of core statistical principles like independence, sampling theory, conditional probability, and the distinction between correlation and causation.
Each fallacy represents a violation of a specific statistical principle. When you encounter a fallacy question, you're really being asked to identify which principle was broken. Don't just memorize the names. Know what concept each fallacy illustrates and why the reasoning fails.
These fallacies involve misunderstanding how variables relate to each other. The core principle: association between variables tells you nothing about the direction or mechanism of influence without proper experimental design.
Two correlated variables may have no causal relationship. Correlation can arise from coincidence, confounding variables, or reverse causation (where the presumed effect actually causes the presumed cause).
A trend visible in every subgroup can reverse when the data is aggregated. This happens when a lurking variable affects both the grouping and the outcome, and the subgroups have very different sizes.
Extreme measurements naturally move closer to the average on subsequent measurements. This is a mathematical inevitability driven by random variation, not a real change in underlying performance.
Compare: Correlation โ Causation vs. Simpson's Paradox: both involve misreading relationships between variables, but correlation errors ignore confounders while Simpson's Paradox involves confounders that reverse apparent effects when data is stratified. If a problem presents aggregated vs. disaggregated data showing opposite trends, that's Simpson's Paradox.
These fallacies stem from misunderstanding how probability works, especially regarding independence and conditional probability. The principle: past outcomes of independent events provide zero information about future outcomes.
Independent events have no memory. The probability of heads on a fair coin is always , regardless of previous flips.
Ignoring the prior probability (base rate) of an event leads to wildly incorrect conclusions about conditional probabilities. Bayes' Theorem corrects this:
Medical testing illustrates this well. Suppose a disease affects 1 in 10,000 people and a test has 99% sensitivity and 99% specificity. In a population of 10,000, about 1 person truly has the disease (and tests positive), but about 100 healthy people also test positive (1% false positive rate ร 9,999). So a positive result means roughly a 1-in-101 chance of actually having the disease. The base rate of the disease is so low that false positives vastly outnumber true positives.
Compare: Gambler's Fallacy vs. Base Rate Fallacy: both involve probability errors, but the gambler's fallacy misunderstands independence while the base rate fallacy misunderstands conditional probability. The gambler ignores that events are independent; the base rate ignorer fails to weight prior probabilities correctly using Bayes' Theorem.
These fallacies occur when the data you analyze doesn't represent the population you care about. The principle: conclusions are only valid for the population from which you properly sampled.
Analyzing only successful cases systematically ignores failures, creating false optimism about success rates. The missing data is invisible by definition.
Non-representative samples invalidate inference. Conclusions drawn from biased samples don't generalize to the population.
Group-level patterns don't necessarily apply to individuals. Aggregate statistics describe averages, not specific cases.
Compare: Survivorship Bias vs. Sampling Bias: both produce unrepresentative data, but survivorship bias specifically excludes failures or non-survivors, always skewing results toward success. Sampling bias can skew in any direction depending on the selection mechanism. Survivorship bias is a specific type of selection bias with a predictable direction.
These fallacies involve how we handle and interpret data after collection. The principle: honest analysis requires considering all relevant evidence and building models that generalize, not just fit the data you already have.
Selective reporting means highlighting only data that supports a predetermined conclusion while suppressing contradictory evidence.
Models that are too complex capture noise rather than signal. They fit the training data perfectly but fail on new data.
Compare: Cherry-Picking vs. Overfitting: both lead to conclusions that won't replicate, but cherry-picking is a data selection problem while overfitting is a model complexity problem. Cherry-picking manipulates which data enters the analysis; overfitting manipulates how flexibly the model conforms to that data.
| Concept | Best Examples |
|---|---|
| Causation errors | Correlation โ Causation, Simpson's Paradox |
| Probability misunderstanding | Gambler's Fallacy, Base Rate Fallacy |
| Selection/sampling problems | Survivorship Bias, Sampling Bias, Ecological Fallacy |
| Natural variation | Regression to the Mean |
| Data manipulation | Cherry-Picking |
| Model complexity | Overfitting |
| Aggregation problems | Simpson's Paradox, Ecological Fallacy |
| Independence violations | Gambler's Fallacy |
A company notices that employees who attend training sessions have higher performance reviews. They conclude the training is effective. Which two fallacies might be at play, and how would you design a study to establish causation?
Compare and contrast survivorship bias and sampling bias. Both involve unrepresentative data. What distinguishes when each applies?
A basketball player makes 10 free throws in a row. Her coach benches her for the next game, and she only makes 6 of 10. The coach claims the rest helped her "come back to earth." What fallacy explains the decline without invoking the coach's theory?
A rare disease affects 1 in 10,000 people. A test is 99% accurate (both sensitivity and specificity). If someone tests positive, why might they still probably not have the disease? Which fallacy does ignoring this represent?
An analyst builds a model with 50 predictor variables that explains 98% of variance in historical stock returns but performs terribly on new data. Identify the fallacy and explain what statistical principle was violated.