Simple random sampling
Definition of simple random sampling
Simple random sampling (SRS) means selecting a subset of individuals from a population so that every individual has an equal probability of being chosen. You use random selection methods (random number generators, lottery systems) to ensure unbiased selection. This requires a complete list of all members of the population, known as a sampling frame.
Advantages of simple random sampling
- Eliminates selection bias by giving every member of the population an equal chance of inclusion
- Allows you to use standard statistical methods to analyze results and estimate population parameters
- Produces a representative sample, enabling generalizations about the larger population
Disadvantages of simple random sampling
- Can be time-consuming and expensive, especially for large or geographically dispersed populations
- Requires a complete and accurate sampling frame, which isn't always available
- May not adequately represent small subgroups if the overall sample size is too small
Stratified sampling
Definition of stratified sampling
Stratified sampling divides the population into distinct subgroups called strata based on known characteristics (age, gender, income level). You then draw an independent random sample from each stratum. This guarantees that each subgroup appears in the final sample.
Advantages of stratified sampling
- Guarantees representation of all important subgroups within the population
- Increases precision by reducing sampling variability within each stratum
- Allows direct comparison across subgroups
Disadvantages of stratified sampling
- Requires prior knowledge of the population's characteristics to define appropriate strata
- More complex and time-consuming than SRS, since you draw multiple separate samples
- Offers little advantage over SRS if variability within strata is similar to variability between strata
Proportional vs disproportional allocation
- Proportional allocation assigns sample sizes to each stratum in proportion to that stratum's share of the population. If women make up 60% of the population, 60% of your sample comes from the women stratum.
- Disproportional allocation assigns larger samples to strata with greater internal variability, regardless of their population share.
- Optimal allocation (Neyman allocation) is a specific form of disproportional allocation that minimizes the variance of the overall estimate for a fixed total sample size.
Cluster sampling
Definition of cluster sampling
Cluster sampling divides the population into naturally occurring groups called clusters (schools, city blocks, hospitals). You randomly select some clusters, then sample from within those clusters. This is especially useful when a complete list of individuals doesn't exist but clusters can be easily identified.
Advantages of cluster sampling
- Reduces cost and time by focusing data collection on selected clusters rather than scattered individuals
- Eliminates the need for a complete sampling frame of every individual
- Allows study of naturally occurring groups and within-group relationships
Disadvantages of cluster sampling
- Clusters may not be representative of the entire population, increasing sampling error
- Requires a larger total sample size than SRS to achieve the same precision
- The clustering effect (individuals within a cluster tend to be more similar to each other than to those in other clusters) reduces effective sample size
Single-stage vs multi-stage clustering
- Single-stage: you randomly select clusters and include all members within the chosen clusters.
- Multi-stage: you randomly select clusters, then randomly select individuals within each chosen cluster. This is more practical for large or geographically dispersed populations because you don't need to survey every person in a selected cluster.
Systematic sampling
Definition of systematic sampling
Systematic sampling selects every th element from a population list, starting with a randomly chosen element between 1 and . The sampling interval is calculated by dividing the population size by the desired sample size. The list needs to be arranged in some order (alphabetical, chronological, etc.).
Advantages of systematic sampling
- Simple to implement: only the first element is chosen randomly, and the rest follow a fixed interval
- Ensures even coverage across the entire population list
- Can be more efficient than SRS for large populations

Disadvantages of systematic sampling
- Periodicity in the list can introduce bias. If the sampling interval happens to align with a repeating pattern in the data, your sample will be skewed.
- Requires a complete, ordered list of the population
- Estimating sampling error and constructing confidence intervals is more complex than with SRS
Sample size determination
Factors influencing sample size
- Desired precision: smaller margins of error require larger samples
- Population variability: more variable populations need larger samples
- Confidence level: higher confidence (99% vs. 95%) requires larger samples
- Available resources: budget, time, and personnel may constrain feasible sample size
Sample size for estimating means
The required sample size depends on the population standard deviation, desired margin of error, and confidence level:
where is the sample size, is the critical value for the desired confidence level, is the population standard deviation, and is the margin of error.
For example, to estimate a population mean within units at 95% confidence with : , so you'd need at least 97 observations.
Sample size for estimating proportions
where is the estimated population proportion. If you have no prior estimate for , use because that maximizes and gives the most conservative (largest) sample size.
Sample size for comparing means
This depends on the difference in means you want to detect, the population standard deviations, desired statistical power, and significance level. You'll need to specify the null and alternative hypotheses and whether the test is one-tailed or two-tailed.
Sample size for comparing proportions
Similarly, this depends on the difference in proportions to detect, the population proportions, desired power, and significance level. Again, you specify hypotheses and test direction.
Sampling error and bias
Definition of sampling error
Sampling error is the difference between a sample statistic and the corresponding population parameter. It arises from the random variation inherent in drawing a sample rather than measuring the entire population. You can reduce it by increasing sample size or using more efficient sampling methods (like stratified sampling).
Definition of sampling bias
Sampling bias is a systematic error that occurs when the sample is not representative of the population. Unlike sampling error, bias causes estimates to consistently deviate from the true value in one direction. Increasing sample size does not fix bias, because the problem is in the sampling method itself.
Types of sampling bias
- Selection bias: the sampling method favors certain members over others
- Voluntary response bias: people who choose to participate often have stronger opinions than those who don't
- Undercoverage bias: certain population members have a lower probability of being included (e.g., a phone survey that misses people without phones)
- Non-response bias: people who respond to a survey differ systematically from those who don't
- Measurement bias: the data collection process itself distorts responses
- Leading question bias: question wording pushes respondents toward a particular answer
- Social desirability bias: respondents answer in ways that make them look good rather than answering honestly
Reducing sampling error and bias
- Use probability sampling methods (SRS, stratified sampling) to ensure representativeness
- Increase sample size to reduce sampling error (but remember this won't fix bias)
- Design questionnaires carefully to minimize measurement bias
- Implement strategies to boost response rates (incentives, follow-up contacts) to reduce non-response bias
- Compare your sample's key characteristics to known population values and apply weighting techniques to adjust for discrepancies
Point estimation
Definition of point estimation
Point estimation uses sample data to calculate a single value as the best guess for a population parameter. Common point estimates include the sample mean (for the population mean), sample proportion (for the population proportion), and sample variance (for the population variance). A point estimate is concise but tells you nothing about the uncertainty around it.
Properties of good estimators
- Unbiasedness: the expected value of the estimator equals the true parameter. On average, across many samples, the estimator hits the right target.
- Efficiency: among all unbiased estimators, the efficient one has the smallest variance. It gives you the tightest clustering around the true value.
- Consistency: as sample size grows, the estimator converges in probability to the true parameter.
- Sufficiency: the estimator captures all the information in the sample that's relevant to the parameter.
Maximum likelihood estimation
Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable.
- Write down the likelihood function, which is the joint probability of observing your sample data given the parameter values.
- Take the natural log to get the log-likelihood function (this simplifies the math).
- Take the derivative of the log-likelihood with respect to the parameter, set it equal to zero, and solve.
MLE estimators are consistent and asymptotically efficient, which makes them widely used in practice.

Method of moments estimation
Method of moments (MoM) works by setting sample moments equal to their population counterparts and solving for the parameters.
- Express population moments (mean, variance, etc.) in terms of the unknown parameters.
- Replace population moments with the corresponding sample moments.
- Solve the resulting system of equations for the parameter estimates.
MoM is generally less efficient than MLE but can be easier to compute, especially when the likelihood function is hard to work with.
Interval estimation
Definition of interval estimation
Interval estimation produces a range of plausible values for a population parameter rather than a single point estimate. These intervals, called confidence intervals, incorporate sample variability and a chosen confidence level to account for uncertainty.
Confidence intervals
A confidence interval is a range of values centered around the point estimate. It's constructed using three ingredients:
- The point estimate (e.g., sample mean)
- The standard error of the estimate
- A critical value from the appropriate distribution (z-distribution or t-distribution)
The confidence level (typically 90%, 95%, or 99%) specifies how often the procedure produces intervals that contain the true parameter.
Interpreting confidence intervals
This is one of the most commonly misunderstood concepts in statistics. A 95% confidence interval does not mean there's a 95% probability the true parameter lies within that specific interval. The true parameter is fixed; it's either in the interval or it isn't.
The correct interpretation: if you repeated the sampling process many times and built a 95% CI each time, about 95% of those intervals would contain the true parameter.
Factors affecting confidence interval width
- Sample size: larger samples produce narrower intervals
- Population variability: more variable populations produce wider intervals
- Confidence level: higher confidence (99% vs. 95%) produces wider intervals because you need a bigger net to be more certain
- Sampling method: more efficient methods like stratified sampling can produce narrower intervals than SRS for the same sample size
Estimating population means
Sample mean as an estimator
The sample mean is an unbiased and consistent estimator of the population mean . It's calculated as:
Unbiasedness means : if you could take every possible sample and average all the sample means, you'd get exactly .
Standard error of the mean
The standard error measures how much varies from sample to sample:
Notice the in the denominator. To cut the standard error in half, you need to quadruple your sample size. When is unknown (which is almost always the case), substitute the sample standard deviation .
Confidence intervals for population means
For a 95% confidence interval:
With for 95% confidence. If is unknown and you use , replace the z-value with the appropriate t-value.
t-distribution for small samples
When the sample size is small (typically ) and is unknown, you use the t-distribution instead of the standard normal. The t-distribution has heavier tails, which produces wider confidence intervals. This accounts for the extra uncertainty introduced by estimating with from a small sample.
The degrees of freedom are . As grows large, the t-distribution converges to the standard normal distribution.
Estimating population proportions
Sample proportion as an estimator
The sample proportion is an unbiased and consistent estimator of the population proportion :
where is the number of "successes" (individuals with the characteristic of interest) and is the sample size. Its expected value is .
Standard error of the proportion
When is unknown, substitute from your sample.
Confidence intervals for population proportions
For a 95% confidence interval:
where for 95% confidence.
Normal approximation for large samples
The sampling distribution of is technically binomial, but for large samples you can approximate it with a normal distribution. The standard rule of thumb is that both and should hold.
This approximation comes from the Central Limit Theorem: as increases, the distribution of approaches a normal distribution regardless of the population's shape. If these conditions aren't met, you should use exact binomial methods instead.