upgrade
upgrade

🎲Data Science Statistics

Types of Sampling Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Sampling is the backbone of statistical inference—and you're being tested on your ability to choose the right method for a given scenario, not just define terms. Every dataset you analyze in data science starts with how the data was collected, and flawed sampling leads to biased estimates, invalid confidence intervals, and misleading conclusions. The exam will push you to understand when each method works, why it reduces (or introduces) bias, and how sampling design affects the validity of your inferences.

The key concepts here revolve around randomization, representativeness, and practical constraints. You'll need to distinguish between probability and non-probability methods, recognize trade-offs between precision and cost, and identify when a sampling approach threatens external validity. Don't just memorize the list—know what statistical principle each method leverages and what can go wrong when assumptions are violated.


Probability Sampling Methods

These methods ensure every member of the population has a known, non-zero probability of selection. This property is what allows us to calculate standard errors, construct confidence intervals, and make valid inferences about the population.

Simple Random Sampling (SRS)

  • Every unit has equal selection probability—this is the gold standard baseline that other methods are compared against
  • Implementation uses random number generators or lottery methods; requires a complete sampling frame listing all population members
  • Eliminates selection bias but may miss rare subgroups by chance, especially in smaller samples

Stratified Sampling

  • Population divided into homogeneous strata before sampling—strata are subgroups sharing a characteristic relevant to the outcome variable
  • Samples drawn independently from each stratum, guaranteeing representation of all key subgroups (e.g., age brackets, income levels)
  • Reduces variance in estimates compared to SRS when strata have different means; enables subgroup analysis with adequate sample sizes

Cluster Sampling

  • Entire clusters (not individuals) are randomly selectedclusters are naturally occurring groups like schools, hospitals, or geographic areas
  • Cost-effective for geographically dispersed populations since you only need to access selected clusters, not the entire sampling frame
  • Increases sampling variance when units within clusters are similar to each other; the design effect quantifies this efficiency loss

Compare: Stratified vs. Cluster Sampling—both divide populations into groups, but stratified sampling takes individuals from every stratum while cluster sampling takes all individuals from selected clusters. If an FRQ asks about reducing variance, stratified is your answer; if it asks about reducing costs for spread-out populations, think cluster.

Systematic Sampling

  • Select every kkth unit after a random starting point, where k=Nnk = \frac{N}{n} (N = population size, n = desired sample size)
  • Easier to implement than SRS when you have an ordered list but no random number generator readily available
  • Risk of periodicity bias—if the list has a hidden pattern matching your interval kk, estimates become severely biased

Multistage Sampling

  • Combines methods hierarchically—typically cluster sampling first, then SRS or stratified sampling within selected clusters
  • Essential for national surveys where no single sampling frame exists; each stage has its own selection probabilities
  • Requires complex variance estimation since each stage contributes to overall sampling error; standard formulas don't apply directly

Compare: Systematic vs. Simple Random Sampling—both aim for equal probability selection, but systematic is operationally simpler. The catch: SRS is always unbiased, while systematic sampling can be biased if population ordering has periodicity. When in doubt on an exam, SRS is the safer theoretical baseline.


Non-Probability Sampling Methods

These methods do not give all population members a known chance of selection. Statistical inference becomes problematic because you cannot calculate valid standard errors or confidence intervals—results describe only your sample, not the population.

Convenience Sampling

  • Samples whoever is easiest to reach—mall intercepts, online volunteers, students in your class
  • Fast and cheap but introduces severe selection bias since accessible individuals differ systematically from the population
  • Cannot support generalization; useful only for pilot testing instruments or generating hypotheses, never for final inference

Quota Sampling

  • Researcher sets target numbers for subgroups (e.g., 50 men, 50 women) and fills quotas through non-random selection
  • Mimics stratified sampling's structure but lacks random selection within quotas, so selection bias persists
  • Common in market research where speed matters more than statistical validity; results are descriptive, not inferential

Purposive (Judgment) Sampling

  • Researcher deliberately selects "typical" or "information-rich" casesrelies entirely on expert judgment about who belongs in the sample
  • Valuable for qualitative research and exploratory studies where depth matters more than breadth
  • Generalizability is impossible since selection criteria are subjective; never appropriate when population-level estimates are needed

Compare: Quota vs. Stratified Sampling—both ensure subgroup representation, but stratified uses random selection within strata while quota uses researcher judgment. Exam tip: if a question describes "ensuring 30% of respondents are from each region" without mentioning random selection, it's quota sampling, and you should flag the bias risk.


Choosing the Right Method

The choice between methods depends on research goals, available resources, and acceptable trade-offs. Probability methods support inference; non-probability methods sacrifice validity for practicality.

ConsiderationProbability MethodsNon-Probability Methods
Valid inferenceYes—standard errors calculableNo—cannot generalize
Bias controlRandomization eliminates selection biasSelection bias likely
Cost/timeHigher (need sampling frame, random selection)Lower (grab who's available)
Best use caseConfirmatory research, policy decisionsExploratory research, pilot studies

Compare: Probability vs. Non-Probability Sampling—the fundamental distinction is whether you can calculate the probability that any given unit enters your sample. If yes, you can do inference. If no, your results are descriptive only. FRQs often present a scenario and ask you to identify the sampling method and evaluate whether conclusions are valid—this distinction is your key.


Quick Reference Table

ConceptBest Examples
Equal probability selectionSimple Random Sampling, Systematic Sampling
Variance reduction through homogeneityStratified Sampling
Cost reduction for dispersed populationsCluster Sampling, Multistage Sampling
Hierarchical population structureMultistage Sampling
Speed over validityConvenience Sampling, Quota Sampling
Qualitative/exploratory researchPurposive Sampling
Valid statistical inferenceAll probability methods (SRS, Stratified, Cluster, Systematic, Multistage)
Selection bias riskAll non-probability methods (Convenience, Quota, Purposive)

Self-Check Questions

  1. A researcher wants to estimate average household income in a city but only has resources to visit 10 neighborhoods. She randomly selects 10 neighborhoods and surveys every household in each. What sampling method is this, and what is its main disadvantage compared to SRS?

  2. Which two sampling methods both divide the population into groups but differ in which units ultimately get sampled? Explain the key distinction and when you'd prefer each.

  3. A polling company ensures their sample includes 40% Democrats, 40% Republicans, and 20% Independents by interviewing people at a shopping center until they hit those numbers. Identify the sampling method and explain why confidence intervals from this data would be invalid.

  4. Compare systematic sampling and simple random sampling: under what specific condition does systematic sampling produce biased estimates while SRS would not?

  5. An FRQ describes a study where researchers first randomly selected 5 states, then randomly selected 3 counties within each state, then surveyed 100 randomly chosen residents per county. Name the sampling method, identify how many stages it has, and explain why standard variance formulas cannot be directly applied.