Types of Data and Sampling Methods
Data types and sampling methods form the foundation of statistical analysis. Knowing what kind of data you're working with determines which tools you can use, and knowing how to sample properly determines whether your results actually mean anything.

Qualitative vs quantitative data
Qualitative data (also called categorical data) describes attributes or categories rather than numerical measurements. You analyze it using frequencies, proportions, or percentages. Examples: hair color (blonde, brown, black, red), movie genres (action, comedy, drama), or phone brands (Apple, Samsung, Google).
Quantitative data is numeric and can be measured or counted. It splits into two subtypes:
- Discrete quantitative data has countable values, often whole numbers. There are gaps between possible values. Think: number of pets owned (0, 1, 2, 3), number of languages spoken (1, 2, 3, 4), or number of bedrooms in a house (1, 2, 3, 4, 5).
- Continuous quantitative data can take on any value within a range, including decimals. There are infinitely many possible values between any two points. Think: body temperature (98.6°F, 99.2°F, 100.5°F), distance traveled (5.2 miles, 10.7 miles), or time spent studying (1.5 hours, 2.75 hours).
A quick test: if you can ask "how many?" it's likely discrete. If you ask "how much?" or "how long?" it's likely continuous.
Interpretation of two-way tables
Two-way tables (also called contingency tables) display counts of data organized by two categorical variables. Rows represent levels of one variable, columns represent levels of the other, and each cell shows the count for that specific combination.
Marginal distributions describe a single variable while ignoring the other. You find them by summing across an entire row or column.
These totals appear in the "margins" of the table (the rightmost column and bottom row), which is where the name comes from.
Conditional distributions show the distribution of one variable given a specific level of the other variable. You calculate them by dividing a cell value by its corresponding row or column total.
- , read as "the probability of A given B"
- , read as "the probability of B given A"
The key distinction: marginal distributions look at one variable across the whole table, while conditional distributions zoom in on one variable within a specific group of the other variable. When a question says "given that" or "among those who," it's asking for a conditional distribution.

Methods of random sampling
Each sampling method has a specific use case. Choosing the right one depends on your population structure, available resources, and what kind of representation you need.
- Simple random sampling (SRS): Every member of the population has an equal chance of being selected. You randomly choose individuals from a population of size , often using a random number generator. Example: randomly selecting 100 students from a school of 1,000 students. SRS is the gold standard, but it can be impractical for large or geographically spread-out populations.
- Stratified sampling: Divide the population into homogeneous subgroups called strata based on a shared characteristic, then perform SRS within each stratum. This guarantees representation from every subgroup. Example: dividing a city's population by income level (low, medium, high) and randomly sampling from each level. Use this when you know certain subgroups matter for your research question.
- Cluster sampling: Divide the population into naturally occurring groups called clusters (like city blocks, classrooms, or hospitals). Randomly select some clusters, then survey everyone within those chosen clusters. Example: randomly selecting 10 city blocks and surveying all households in those blocks. This is practical when you don't have a complete list of every individual in the population, but be aware that clusters may not be internally diverse.
- Systematic sampling: Randomly pick a starting point, then select every th element from the list, where . Example: from a list of 1,000 people with a desired sample of 100, you'd pick every 10th person starting from a randomly chosen point. This spreads the sample evenly across the list, but watch out for bias if the list has a repeating pattern that aligns with your interval.
Measures of variability and statistical inference
- Variance quantifies the average squared deviation from the mean. Squaring the deviations ensures that values above and below the mean don't cancel each other out. A larger variance means the data points are more spread out.
- Standard deviation is the square root of the variance. It brings the measure of spread back into the original units of the data, making it more interpretable. If the standard deviation of exam scores is 8 points, that tells you roughly how far a typical score falls from the mean.
- Sampling distribution is the distribution of a statistic (like the sample mean) across all possible samples of a given size from a population. It tells you how much that statistic varies from sample to sample.
- Central Limit Theorem (CLT): As the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the shape of the original population distribution. This is what allows us to use normal distribution properties for inference even when the population itself isn't normal.
- Confidence interval is a range of values constructed from sample data that is likely to contain the true population parameter. It's built from the sample statistic plus or minus a margin of error (which depends on the standard error). A wider interval means less precision; a narrower interval means more precision.