Statistics gives you a set of tools for collecting, organizing, and interpreting data so you can answer questions and make decisions. Whether you're reading a poll, designing an experiment, or just trying to figure out what a dataset is telling you, these foundational terms and concepts come up constantly throughout the course.

Descriptive vs. Inferential Statistics

There are two broad categories of statistics, and the distinction matters because they answer different questions.

Descriptive statistics summarize and describe the main features of a dataset without drawing conclusions beyond the data you have. The goal is to organize and present data in a way that's easy to understand.

Measures like mean, median, mode, and standard deviation
Visual tools like tables, charts, and graphs
Example: calculating the average GPA of 200 students in a class

Inferential statistics use sample data to make generalizations, predictions, or conclusions about a larger population. You're going beyond the data you collected.

Techniques include hypothesis testing, confidence intervals, and regression analysis
Example: surveying 1,000 voters to predict the outcome of a national election, or testing a new drug on 500 patients to draw conclusions about its effectiveness for all patients

The key difference: descriptive statistics describe what you have, while inferential statistics use what you have to say something about what you don't have.

Descriptive vs inferential statistics, Why It Matters: Linking Probability to Statistical Inference | Statistics for the Social Sciences

Key Terms in Statistical Studies

These terms show up everywhere in statistics, so it's worth getting them straight now.

Population is the entire group of individuals, objects, or events you're interested in studying. Populations are often too large to study completely. If you wanted to know the average income of all U.S. adults, you couldn't realistically collect data from every single person.

Sample is a subset of the population selected for study. A good sample is representative of the population it came from. For example, 1,500 randomly selected U.S. adults could serve as a sample for studying income patterns nationwide.

Parameter is a numerical value that describes a characteristic of a population. Parameters are usually unknown because you rarely have data on the entire population. You estimate them using sample data.

Population mean: $\mu$
Population standard deviation: $\sigma$

Statistic is a numerical value computed from sample data, used to estimate the corresponding parameter.

Sample mean: $\bar{x}$
Sample standard deviation: $s$

A quick way to keep these straight: parameter goes with population, statistic goes with sample.

Sampling is the process of selecting individuals from a population. How you sample matters a lot for whether your results are trustworthy, and you'll spend more time on sampling methods later in this unit.

Descriptive vs inferential statistics, Definitions of Statistics, Probability, and Key Terms | Introduction to Statistics – Gravina

Numerical and Categorical Variables

A variable is any characteristic or quantity that can vary from one individual to another. Variables fall into two main types, and each type breaks down further.

Numerical (quantitative) variables take on numeric values where math operations like addition and averaging make sense.

Discrete variables have countable values, often whole numbers. Examples: number of siblings, number of classes you're taking this semester.
Continuous variables can take on any value within a range, including decimals. Examples: height (5.74 feet), weight (162.3 lbs), temperature.

Categorical (qualitative) variables place individuals into groups or categories. You can count how many fall into each category, but averaging the categories doesn't make sense.

Nominal variables have categories with no inherent order. Examples: blood type (A, B, AB, O), eye color, zip code.
Ordinal variables have categories with a meaningful order, but the distances between categories aren't necessarily equal. Examples: education level (high school, bachelor's, master's, doctorate), satisfaction ratings (poor, fair, good, excellent).

A common mistake is treating zip codes or jersey numbers as numerical variables just because they're made of digits. If you can't meaningfully add or average the values, the variable is categorical.

You won't dive deep into these topics yet, but they're worth defining now since they connect directly to inferential statistics later in the course.

Probability is a measure of how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain). For example, the probability of flipping heads on a fair coin is $0.5$ .

Distribution describes the pattern of how values in a dataset are spread out. Some distributions are symmetric, others are skewed, and the shape tells you a lot about the data.

Variance measures how spread out data points are around the mean. Higher variance means the values are more scattered; lower variance means they cluster closer to the mean.

Correlation measures the strength and direction of the relationship between two variables. A positive correlation means both variables tend to increase together; a negative correlation means one tends to decrease as the other increases.

Hypothesis testing is a formal procedure for using sample data to test a claim about a population. It involves two competing statements:

Null hypothesis ( $H_0$ ): a statement of no effect or no difference
Alternative hypothesis ( $H_a$ ): a statement that there is an effect or difference

You'll work through the mechanics of hypothesis testing in detail later, but the core idea is straightforward: you assume the null hypothesis is true and then check whether your sample data provides strong enough evidence to reject it.