upgrade
upgrade

📊Principles of Data Science

Data Sampling Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Sampling is the backbone of statistical inference—it's how we draw conclusions about millions of data points by examining just a fraction of them. When you're working with massive datasets or surveying populations, you can't analyze everything, so the method you choose to select your sample determines whether your findings are valid, generalizable, and unbiased. This topic connects directly to core concepts like bias-variance tradeoffs, statistical inference, and experimental design.

You're being tested on more than just definitions here. Exam questions will ask you to identify which sampling method fits a given scenario, explain why one technique introduces bias while another doesn't, and evaluate the tradeoffs between cost, precision, and representativeness. Don't just memorize the names—know what problem each technique solves and when it fails.


Probability Sampling Methods

These techniques give every member of the population a known, non-zero chance of being selected. This mathematical foundation is what allows us to make valid statistical inferences and calculate margins of error.

Simple Random Sampling

  • Every individual has an equal selection probability—this is the gold standard for eliminating selection bias
  • Implementation uses random number generators or lottery methods; requires a complete sampling frame (a list of all population members)
  • Best for homogeneous populations where subgroup representation isn't a concern; forms the theoretical basis for most statistical tests

Stratified Sampling

  • Divides population into non-overlapping strata based on known characteristics (age, income, region) before sampling within each group
  • Guarantees representation of all subgroups—critical when minority groups might be missed by pure random sampling
  • Reduces variance and increases precision compared to simple random sampling of the same size; requires prior knowledge of stratifying variables

Systematic Sampling

  • Selects every kkth element after a random starting point, where k=Nnk = \frac{N}{n} (population size divided by desired sample size)
  • Simpler to execute than simple random sampling—no need for random number generation after the initial selection
  • Vulnerable to periodicity bias if the list has a hidden pattern that aligns with your sampling interval; always verify list ordering is arbitrary

Cluster Sampling

  • Randomly selects entire groups (clusters) rather than individuals—often geographic units like schools, city blocks, or hospitals
  • Dramatically reduces costs when population members are physically dispersed; doesn't require a complete list of all individuals
  • Trades precision for practicality—sampling error increases if clusters differ significantly from each other; works best when clusters are internally heterogeneous

Compare: Stratified vs. Cluster Sampling—both divide populations into groups, but stratified sampling takes individuals from every stratum while cluster sampling takes all individuals from selected clusters only. If an FRQ asks about reducing variance, stratified is your answer; if it asks about cost efficiency for geographically dispersed populations, go with cluster.

Multi-stage Sampling

  • Combines sampling methods hierarchically—typically clusters first, then random or stratified sampling within selected clusters
  • Balances representativeness with feasibility for large-scale studies like national surveys or census operations
  • Requires careful variance estimation since error compounds at each stage; standard formulas must account for the multi-level design

Compare: Simple Random vs. Multi-stage Sampling—simple random is theoretically optimal but often impractical for large populations. Multi-stage sacrifices some precision for massive gains in cost and logistics. Know when practical constraints justify this tradeoff.


Non-Probability Sampling Methods

These techniques don't give every population member a known chance of selection. They're faster and cheaper but limit your ability to generalize findings or calculate true confidence intervals.

Convenience Sampling

  • Selects whoever is easiest to reach—students in your class, people walking by, users who opt in
  • High risk of selection bias since accessible individuals often differ systematically from the broader population
  • Appropriate only for pilot studies or exploratory research where generalizability isn't the goal; never use for final inference

Quota Sampling

  • Sets target numbers for demographic categories (e.g., 50 men, 50 women) but uses non-random selection within each quota
  • Ensures demographic diversity without the logistical demands of true stratified sampling
  • Selection bias persists within quotas—the researcher chooses which 50 men, introducing subjectivity; common in market research

Compare: Stratified vs. Quota Sampling—both aim for subgroup representation, but stratified uses random selection within strata (probability method) while quota lets researchers pick non-randomly (non-probability). This distinction determines whether you can calculate valid confidence intervals.

Purposive Sampling

  • Researcher deliberately selects cases that fit specific criteria or represent particular phenomena of interest
  • Maximizes information for qualitative research—choosing "typical" cases, extreme cases, or expert informants
  • Cannot support statistical generalization since selection is based on judgment, not probability; findings apply only to cases studied

Snowball Sampling

  • Participants recruit other participants through their social networks—each subject refers others who qualify
  • Essential for hidden or hard-to-reach populations—undocumented immigrants, people with rare diseases, underground communities
  • Sample clusters around initial contacts creating network-based bias; representativeness depends entirely on starting points and network structure

Compare: Convenience vs. Snowball Sampling—both are non-probability methods, but convenience samples whoever's available while snowball specifically leverages social connections. Use snowball when your target population has no sampling frame; use convenience only when you need quick preliminary data.


The Probability vs. Non-Probability Distinction

This isn't just a category—it's the fundamental divide that determines what statistical claims you can make.

Why This Distinction Matters

  • Probability methods enable statistical inference—you can calculate standard errors, confidence intervals, and p-values because selection probabilities are known
  • Non-probability methods support exploration, not confirmation—useful for generating hypotheses, understanding mechanisms, or accessing difficult populations
  • Choosing incorrectly invalidates your analysis—applying inferential statistics to a convenience sample produces meaningless confidence intervals, even if the math runs

Quick Reference Table

ConceptBest Examples
Equal selection probabilitySimple Random Sampling
Guaranteed subgroup representationStratified Sampling, Quota Sampling
Cost-effective for dispersed populationsCluster Sampling, Multi-stage Sampling
Requires complete sampling frameSimple Random, Systematic, Stratified
Hidden/hard-to-reach populationsSnowball Sampling
Exploratory research onlyConvenience Sampling, Purposive Sampling
Vulnerable to periodicitySystematic Sampling
Valid for statistical inferenceAll probability methods (Simple Random, Stratified, Cluster, Systematic, Multi-stage)

Self-Check Questions

  1. A researcher wants to survey voters across 50 states but can only afford to visit 10 states. Within those states, she'll randomly select precincts, then randomly select voters within precincts. Which sampling method is this, and why might it introduce more error than simple random sampling?

  2. Compare stratified sampling and quota sampling: what key procedural difference determines whether you can calculate a valid margin of error?

  3. You're studying individuals with a rare genetic condition that has no registry or public list. Which sampling technique would you use, and what bias should you acknowledge in your findings?

  4. A dataset was collected by surveying people who responded to an online ad. A colleague wants to report 95% confidence intervals for population parameters. What's wrong with this approach?

  5. When would systematic sampling produce a biased sample even though it's technically a probability method? Give a specific example of how list ordering could cause this problem.