Data in statistics comes in two main flavors: qualitative and quantitative. Qualitative data describes attributes using words or categories, while quantitative data uses numbers. Understanding these types helps you choose the right analysis methods.

Sampling is how you collect data without surveying an entire population. Random sampling methods, like simple random and stratified sampling, help ensure your sample actually represents the whole population. The method you choose and the errors you watch for directly affect whether your results are reliable.

Types of Data

Qualitative vs quantitative data

Qualitative data (also called categorical data) describes attributes, characteristics, or categories using non-numerical values. You can sort it into groups, but you can't do meaningful math with it.

Eye color (blue, brown, green), marital status (single, married, divorced), political affiliation (Democrat, Republican, Independent)
Typically analyzed by counting how often each category appears and looking for patterns across groups

Quantitative data represents numerical values that can be measured or counted. It splits into two subtypes:

Discrete quantitative data consists of countable values, often whole numbers. Think of things you can count one by one.
- Examples: Number of siblings (0, 1, 2, 3), number of cars owned by a household (1, 2, 3)
- Analyzed using frequency distributions and measures of central tendency
Continuous quantitative data can take on any value within a range, including decimals. Think of things you measure rather than count.
- Examples: Height (62.5 inches, 71.2 inches), weight (150.3 pounds, 175.8 pounds), temperature (98.6°F, 101.2°F)
- Analyzed using histograms, scatterplots, and measures of variability

A quick way to tell them apart: if you can ask "how many?" it's discrete. If you ask "how much?" or "how long?" it's likely continuous.

Sampling Methods

Qualitative vs quantitative data, Connectedness: Qualitative Data, Quantitative Analysis

Random sampling methods

Simple random sampling ensures each member of the population has an equal chance of being selected. You pull from the entire population at once with no grouping involved.

Example: Assigning every student at a 1,000-person school a number, then using a random number generator to pick 100 of them.

Stratified sampling divides the population into subgroups (called strata) based on a specific characteristic, then randomly samples from each stratum. This guarantees every subgroup is represented.

Example: Dividing a city's population by income level (low, middle, high) and randomly sampling from each income level.

Cluster sampling divides the population into naturally occurring groups (called clusters), then randomly selects entire clusters and includes all members within those selected clusters.

Example: Randomly selecting 10 city blocks out of 200 and surveying every household on those 10 blocks.

Systematic sampling starts by randomly selecting one member, then picks every $k$ th member after that. The interval $k$ is determined by dividing the population size by the desired sample size.

Example: You want 100 people from a list of 1,000. That gives $k = 10$ . Pick a random starting point, then select every 10th name.

Stratified vs. Cluster: In stratified sampling, you sample within every group. In cluster sampling, you pick whole groups and skip the rest. Stratified gives you representation from all subgroups; cluster is often easier and cheaper to carry out.

Variation in Data and Sampling

No sample perfectly mirrors its population. Understanding where variation comes from helps you judge how trustworthy your results are.

Qualitative vs quantitative data, Qualitative vs Quantitative Data | WISELearn Resources

Sources of data variation

Sampling errors come from the random nature of the sampling process itself. Because you're only looking at part of the population, your sample statistics will differ somewhat from the true population parameters.

Sampling variability is the natural variation you see between different samples drawn from the same population. If you surveyed 100 different groups of 50 students, you'd get slightly different averages each time. This is normal and expected.
Sampling error can be reduced by increasing the sample size. Larger samples tend to produce statistics closer to the true population parameter.

Nonsampling errors are problems that arise during data collection, processing, or analysis, not from the randomness of sampling.

Measurement errors: Inaccuracies from faulty instruments, unclear survey questions, or human mistakes during data collection.
Nonresponse bias: When people chosen for the sample don't respond or participate. If nonrespondents differ systematically from respondents, your results get skewed.
Voluntary response bias: When people self-select into a study (like online polls). Those with strong opinions are more likely to participate, so the sample won't represent the broader population.
Processing errors: Mistakes made during data entry, coding, or analysis.

Note: Sampling bias (using a method that systematically favors certain members of the population) is sometimes grouped under sampling error, but it's really a design flaw rather than random chance. A biased sampling method won't improve just by increasing sample size.

Statistical Inference

Statistical inference is the process of using sample data to draw conclusions about a larger population. A few key terms tie this together:

Population: The entire group you want information about.
Sample: The subset of the population you actually study.
Parameter: A numerical characteristic of the population (e.g., the true population mean). You usually don't know this exactly.
Statistic: A numerical characteristic calculated from a sample (e.g., the sample mean). This is what you measure and use to estimate the parameter.
Probability: The likelihood of an event occurring, used to quantify uncertainty when making inferences from samples.
Confidence interval: A range of values, calculated from sample data, that is likely to contain the true population parameter.
Hypothesis testing: A formal method for deciding whether sample data provides enough evidence to support a claim about a population parameter.

The core logic of inference: you can't study everyone, so you study a sample, calculate a statistic, and use it to estimate the population parameter. How much you trust that estimate depends on your sample size, sampling method, and the variation in your data.