📊Principles of Data Science

Data Sampling Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Sampling is the backbone of statistical inference—it's how we draw conclusions about millions of data points by examining just a fraction of them. When you're working with massive datasets or surveying populations, you can't analyze everything, so the method you choose to select your sample determines whether your findings are valid, generalizable, and unbiased. This topic connects directly to core concepts like bias-variance tradeoffs, statistical inference, and experimental design.

You're being tested on more than just definitions here. Exam questions will ask you to identify which sampling method fits a given scenario, explain why one technique introduces bias while another doesn't, and evaluate the tradeoffs between cost, precision, and representativeness. Don't just memorize the names—know what problem each technique solves and when it fails.

Probability Sampling Methods

These techniques give every member of the population a known, non-zero chance of being selected. This mathematical foundation is what allows us to make valid statistical inferences and calculate margins of error.

Simple Random Sampling

Every individual has an equal selection probability—this is the gold standard for eliminating selection bias
Implementation uses random number generators or lottery methods; requires a complete sampling frame (a list of all population members)
Best for homogeneous populations where subgroup representation isn't a concern; forms the theoretical basis for most statistical tests

Stratified Sampling

Divides population into non-overlapping strata based on known characteristics (age, income, region) before sampling within each group
Guarantees representation of all subgroups—critical when minority groups might be missed by pure random sampling
Reduces variance and increases precision compared to simple random sampling of the same size; requires prior knowledge of stratifying variables

Systematic Sampling

Selects every $k$ th element after a random starting point, where $k = \frac{N}{n}$ (population size divided by desired sample size)
Simpler to execute than simple random sampling—no need for random number generation after the initial selection
Vulnerable to periodicity bias if the list has a hidden pattern that aligns with your sampling interval; always verify list ordering is arbitrary

Cluster Sampling

Randomly selects entire groups (clusters) rather than individuals—often geographic units like schools, city blocks, or hospitals
Dramatically reduces costs when population members are physically dispersed; doesn't require a complete list of all individuals
Trades precision for practicality—sampling error increases if clusters differ significantly from each other; works best when clusters are internally heterogeneous

Compare: Stratified vs. Cluster Sampling—both divide populations into groups, but stratified sampling takes individuals from every stratum while cluster sampling takes all individuals from selected clusters only. If an FRQ asks about reducing variance, stratified is your answer; if it asks about cost efficiency for geographically dispersed populations, go with cluster.

Multi-stage Sampling

Combines sampling methods hierarchically—typically clusters first, then random or stratified sampling within selected clusters
Balances representativeness with feasibility for large-scale studies like national surveys or census operations
Requires careful variance estimation since error compounds at each stage; standard formulas must account for the multi-level design

Compare: Simple Random vs. Multi-stage Sampling—simple random is theoretically optimal but often impractical for large populations. Multi-stage sacrifices some precision for massive gains in cost and logistics. Know when practical constraints justify this tradeoff.

Non-Probability Sampling Methods

These techniques don't give every population member a known chance of selection. They're faster and cheaper but limit your ability to generalize findings or calculate true confidence intervals.

Convenience Sampling

Selects whoever is easiest to reach—students in your class, people walking by, users who opt in
High risk of selection bias since accessible individuals often differ systematically from the broader population
Appropriate only for pilot studies or exploratory research where generalizability isn't the goal; never use for final inference

Quota Sampling

Sets target numbers for demographic categories (e.g., 50 men, 50 women) but uses non-random selection within each quota
Ensures demographic diversity without the logistical demands of true stratified sampling
Selection bias persists within quotas—the researcher chooses which 50 men, introducing subjectivity; common in market research

Compare: Stratified vs. Quota Sampling—both aim for subgroup representation, but stratified uses random selection within strata (probability method) while quota lets researchers pick non-randomly (non-probability). This distinction determines whether you can calculate valid confidence intervals.

Purposive Sampling

Researcher deliberately selects cases that fit specific criteria or represent particular phenomena of interest
Maximizes information for qualitative research—choosing "typical" cases, extreme cases, or expert informants
Cannot support statistical generalization since selection is based on judgment, not probability; findings apply only to cases studied

Snowball Sampling

Participants recruit other participants through their social networks—each subject refers others who qualify
Essential for hidden or hard-to-reach populations—undocumented immigrants, people with rare diseases, underground communities
Sample clusters around initial contacts creating network-based bias; representativeness depends entirely on starting points and network structure

Compare: Convenience vs. Snowball Sampling—both are non-probability methods, but convenience samples whoever's available while snowball specifically leverages social connections. Use snowball when your target population has no sampling frame; use convenience only when you need quick preliminary data.

The Probability vs. Non-Probability Distinction

This isn't just a category—it's the fundamental divide that determines what statistical claims you can make.

Why This Distinction Matters

Probability methods enable statistical inference—you can calculate standard errors, confidence intervals, and p-values because selection probabilities are known
Non-probability methods support exploration, not confirmation—useful for generating hypotheses, understanding mechanisms, or accessing difficult populations
Choosing incorrectly invalidates your analysis—applying inferential statistics to a convenience sample produces meaningless confidence intervals, even if the math runs

Quick Reference Table

Concept	Best Examples
Equal selection probability	Simple Random Sampling
Guaranteed subgroup representation	Stratified Sampling, Quota Sampling
Cost-effective for dispersed populations	Cluster Sampling, Multi-stage Sampling
Requires complete sampling frame	Simple Random, Systematic, Stratified
Hidden/hard-to-reach populations	Snowball Sampling
Exploratory research only	Convenience Sampling, Purposive Sampling
Vulnerable to periodicity	Systematic Sampling
Valid for statistical inference	All probability methods (Simple Random, Stratified, Cluster, Systematic, Multi-stage)

Self-Check Questions

A researcher wants to survey voters across 50 states but can only afford to visit 10 states. Within those states, she'll randomly select precincts, then randomly select voters within precincts. Which sampling method is this, and why might it introduce more error than simple random sampling?
Compare stratified sampling and quota sampling: what key procedural difference determines whether you can calculate a valid margin of error?
You're studying individuals with a rare genetic condition that has no registry or public list. Which sampling technique would you use, and what bias should you acknowledge in your findings?
A dataset was collected by surveying people who responded to an online ad. A colleague wants to report 95% confidence intervals for population parameters. What's wrong with this approach?
When would systematic sampling produce a biased sample even though it's technically a probability method? Give a specific example of how list ordering could cause this problem.

📊Principles of Data Science

Data Sampling Techniques

Why This Matters

Probability Sampling Methods

Simple Random Sampling

Stratified Sampling

Systematic Sampling

Cluster Sampling

Multi-stage Sampling

Non-Probability Sampling Methods

Convenience Sampling

Quota Sampling

Purposive Sampling

Snowball Sampling

The Probability vs. Non-Probability Distinction

Why This Distinction Matters

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes