Simple random sampling (SRS) means selecting a subset of individuals from a population so that every individual has an equal probability of being chosen. You use random selection methods (random number generators, lottery systems) to ensure unbiased selection. This requires a complete list of all members of the population, known as a sampling frame.

Advantages of simple random sampling

Eliminates selection bias by giving every member of the population an equal chance of inclusion
Allows you to use standard statistical methods to analyze results and estimate population parameters
Produces a representative sample, enabling generalizations about the larger population

Disadvantages of simple random sampling

Can be time-consuming and expensive, especially for large or geographically dispersed populations
Requires a complete and accurate sampling frame, which isn't always available
May not adequately represent small subgroups if the overall sample size is too small

Stratified sampling

Definition of stratified sampling

Stratified sampling divides the population into distinct subgroups called strata based on known characteristics (age, gender, income level). You then draw an independent random sample from each stratum. This guarantees that each subgroup appears in the final sample.

Advantages of stratified sampling

Guarantees representation of all important subgroups within the population
Increases precision by reducing sampling variability within each stratum
Allows direct comparison across subgroups

Disadvantages of stratified sampling

Requires prior knowledge of the population's characteristics to define appropriate strata
More complex and time-consuming than SRS, since you draw multiple separate samples
Offers little advantage over SRS if variability within strata is similar to variability between strata

Proportional vs disproportional allocation

Proportional allocation assigns sample sizes to each stratum in proportion to that stratum's share of the population. If women make up 60% of the population, 60% of your sample comes from the women stratum.
Disproportional allocation assigns larger samples to strata with greater internal variability, regardless of their population share.
- Optimal allocation (Neyman allocation) is a specific form of disproportional allocation that minimizes the variance of the overall estimate for a fixed total sample size.

Cluster sampling

Definition of cluster sampling

Cluster sampling divides the population into naturally occurring groups called clusters (schools, city blocks, hospitals). You randomly select some clusters, then sample from within those clusters. This is especially useful when a complete list of individuals doesn't exist but clusters can be easily identified.

Advantages of cluster sampling

Reduces cost and time by focusing data collection on selected clusters rather than scattered individuals
Eliminates the need for a complete sampling frame of every individual
Allows study of naturally occurring groups and within-group relationships

Disadvantages of cluster sampling

Clusters may not be representative of the entire population, increasing sampling error
Requires a larger total sample size than SRS to achieve the same precision
The clustering effect (individuals within a cluster tend to be more similar to each other than to those in other clusters) reduces effective sample size

Single-stage vs multi-stage clustering

Single-stage: you randomly select clusters and include all members within the chosen clusters.
Multi-stage: you randomly select clusters, then randomly select individuals within each chosen cluster. This is more practical for large or geographically dispersed populations because you don't need to survey every person in a selected cluster.

Systematic sampling

Definition of systematic sampling

Systematic sampling selects every $k$ th element from a population list, starting with a randomly chosen element between 1 and $k$ . The sampling interval $k$ is calculated by dividing the population size by the desired sample size. The list needs to be arranged in some order (alphabetical, chronological, etc.).

Advantages of systematic sampling

Simple to implement: only the first element is chosen randomly, and the rest follow a fixed interval
Ensures even coverage across the entire population list
Can be more efficient than SRS for large populations

Definition of simple random sampling, Why It Matters: Linking Probability to Statistical Inference | Statistics for the Social Sciences

Disadvantages of systematic sampling

Periodicity in the list can introduce bias. If the sampling interval happens to align with a repeating pattern in the data, your sample will be skewed.
Requires a complete, ordered list of the population
Estimating sampling error and constructing confidence intervals is more complex than with SRS

Sample size determination

Factors influencing sample size

Desired precision: smaller margins of error require larger samples
Population variability: more variable populations need larger samples
Confidence level: higher confidence (99% vs. 95%) requires larger samples
Available resources: budget, time, and personnel may constrain feasible sample size

Sample size for estimating means

The required sample size depends on the population standard deviation, desired margin of error, and confidence level:

$n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2$

where $n$ is the sample size, $z_{\alpha/2}$ is the critical value for the desired confidence level, $\sigma$ is the population standard deviation, and $E$ is the margin of error.

For example, to estimate a population mean within $E = 2$ units at 95% confidence with $\sigma = 10$ : $n = \left(\frac{1.96 \times 10}{2}\right)^2 = 96.04$ , so you'd need at least 97 observations.

Sample size for estimating proportions

$n = \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}$

where $p$ is the estimated population proportion. If you have no prior estimate for $p$ , use $p = 0.5$ because that maximizes $p(1-p)$ and gives the most conservative (largest) sample size.

Sample size for comparing means

This depends on the difference in means you want to detect, the population standard deviations, desired statistical power, and significance level. You'll need to specify the null and alternative hypotheses and whether the test is one-tailed or two-tailed.

Sample size for comparing proportions

Similarly, this depends on the difference in proportions to detect, the population proportions, desired power, and significance level. Again, you specify hypotheses and test direction.

Sampling error and bias

Definition of sampling error

Sampling error is the difference between a sample statistic and the corresponding population parameter. It arises from the random variation inherent in drawing a sample rather than measuring the entire population. You can reduce it by increasing sample size or using more efficient sampling methods (like stratified sampling).

Definition of sampling bias

Sampling bias is a systematic error that occurs when the sample is not representative of the population. Unlike sampling error, bias causes estimates to consistently deviate from the true value in one direction. Increasing sample size does not fix bias, because the problem is in the sampling method itself.

Types of sampling bias

Selection bias: the sampling method favors certain members over others
- Voluntary response bias: people who choose to participate often have stronger opinions than those who don't
- Undercoverage bias: certain population members have a lower probability of being included (e.g., a phone survey that misses people without phones)
Non-response bias: people who respond to a survey differ systematically from those who don't
Measurement bias: the data collection process itself distorts responses
- Leading question bias: question wording pushes respondents toward a particular answer
- Social desirability bias: respondents answer in ways that make them look good rather than answering honestly

Reducing sampling error and bias

Use probability sampling methods (SRS, stratified sampling) to ensure representativeness
Increase sample size to reduce sampling error (but remember this won't fix bias)
Design questionnaires carefully to minimize measurement bias
Implement strategies to boost response rates (incentives, follow-up contacts) to reduce non-response bias
Compare your sample's key characteristics to known population values and apply weighting techniques to adjust for discrepancies

Point estimation

Definition of point estimation

Point estimation uses sample data to calculate a single value as the best guess for a population parameter. Common point estimates include the sample mean (for the population mean), sample proportion (for the population proportion), and sample variance (for the population variance). A point estimate is concise but tells you nothing about the uncertainty around it.

Properties of good estimators

Unbiasedness: the expected value of the estimator equals the true parameter. On average, across many samples, the estimator hits the right target.
Efficiency: among all unbiased estimators, the efficient one has the smallest variance. It gives you the tightest clustering around the true value.
Consistency: as sample size grows, the estimator converges in probability to the true parameter.
Sufficiency: the estimator captures all the information in the sample that's relevant to the parameter.

Maximum likelihood estimation

Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable.

Write down the likelihood function, which is the joint probability of observing your sample data given the parameter values.
Take the natural log to get the log-likelihood function (this simplifies the math).
Take the derivative of the log-likelihood with respect to the parameter, set it equal to zero, and solve.

MLE estimators are consistent and asymptotically efficient, which makes them widely used in practice.

Definition of simple random sampling, Figure 2: Sampling frame work of study participants.

Method of moments estimation

Method of moments (MoM) works by setting sample moments equal to their population counterparts and solving for the parameters.

Express population moments (mean, variance, etc.) in terms of the unknown parameters.
Replace population moments with the corresponding sample moments.
Solve the resulting system of equations for the parameter estimates.

MoM is generally less efficient than MLE but can be easier to compute, especially when the likelihood function is hard to work with.

Interval estimation

Definition of interval estimation

Interval estimation produces a range of plausible values for a population parameter rather than a single point estimate. These intervals, called confidence intervals, incorporate sample variability and a chosen confidence level to account for uncertainty.

Confidence intervals

A confidence interval is a range of values centered around the point estimate. It's constructed using three ingredients:

The point estimate (e.g., sample mean)
The standard error of the estimate
A critical value from the appropriate distribution (z-distribution or t-distribution)

The confidence level (typically 90%, 95%, or 99%) specifies how often the procedure produces intervals that contain the true parameter.

Interpreting confidence intervals

This is one of the most commonly misunderstood concepts in statistics. A 95% confidence interval does not mean there's a 95% probability the true parameter lies within that specific interval. The true parameter is fixed; it's either in the interval or it isn't.

The correct interpretation: if you repeated the sampling process many times and built a 95% CI each time, about 95% of those intervals would contain the true parameter.

Factors affecting confidence interval width

Sample size: larger samples produce narrower intervals
Population variability: more variable populations produce wider intervals
Confidence level: higher confidence (99% vs. 95%) produces wider intervals because you need a bigger net to be more certain
Sampling method: more efficient methods like stratified sampling can produce narrower intervals than SRS for the same sample size

Estimating population means

Sample mean as an estimator

The sample mean $\bar{x}$ is an unbiased and consistent estimator of the population mean $\mu$ . It's calculated as:

$\bar{x} = \frac{\sum_{i=1}^n x_i}{n}$

Unbiasedness means $E(\bar{x}) = \mu$ : if you could take every possible sample and average all the sample means, you'd get exactly $\mu$ .

Standard error of the mean

The standard error measures how much $\bar{x}$ varies from sample to sample:

$SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}$

Notice the $\sqrt{n}$ in the denominator. To cut the standard error in half, you need to quadruple your sample size. When $\sigma$ is unknown (which is almost always the case), substitute the sample standard deviation $s$ .

Confidence intervals for population means

For a 95% confidence interval:

$\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$

With $z_{\alpha/2} = 1.96$ for 95% confidence. If $\sigma$ is unknown and you use $s$ , replace the z-value with the appropriate t-value.

t-distribution for small samples

When the sample size is small (typically $n < 30$ ) and $\sigma$ is unknown, you use the t-distribution instead of the standard normal. The t-distribution has heavier tails, which produces wider confidence intervals. This accounts for the extra uncertainty introduced by estimating $\sigma$ with $s$ from a small sample.

The degrees of freedom are $df = n - 1$ . As $n$ grows large, the t-distribution converges to the standard normal distribution.

Estimating population proportions

Sample proportion as an estimator

The sample proportion $\hat{p}$ is an unbiased and consistent estimator of the population proportion $p$ :

$\hat{p} = \frac{x}{n}$

where $x$ is the number of "successes" (individuals with the characteristic of interest) and $n$ is the sample size. Its expected value is $E(\hat{p}) = p$ .

Standard error of the proportion

$SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$

When $p$ is unknown, substitute $\hat{p}$ from your sample.

Confidence intervals for population proportions

For a 95% confidence interval:

$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

where $z_{\alpha/2} = 1.96$ for 95% confidence.

Normal approximation for large samples

The sampling distribution of $\hat{p}$ is technically binomial, but for large samples you can approximate it with a normal distribution. The standard rule of thumb is that both $n\hat{p} \geq 5$ and $n(1-\hat{p}) \geq 5$ should hold.

This approximation comes from the Central Limit Theorem: as $n$ increases, the distribution of $\hat{p}$ approaches a normal distribution regardless of the population's shape. If these conditions aren't met, you should use exact binomial methods instead.

2,589 studying →