Descriptive statistics give you the tools to summarize a dataset with just a few numbers and visuals. Instead of staring at hundreds of raw data points, you can describe the center, spread, and shape of your data in ways that reveal patterns, outliers, and differences between groups.

more resources to help you study

practice questions

Measures of Central Tendency and Variability

Central Tendency

Measures of central tendency boil a dataset down to a single "typical" value. Each measure has strengths depending on the situation.

Mean is the arithmetic average: sum all values and divide by the count. It works well for roughly symmetric data but gets pulled toward extreme values. For example, a few very high incomes can drag the mean well above what most people actually earn.
Median is the middle value when data are sorted from smallest to largest. If the dataset has an even number of values, you average the two middle ones. Because it ignores how extreme the tails are, the median is more resistant to outliers than the mean. That's why median household income is often reported instead of mean income.
Mode is the most frequently occurring value. A dataset can have one mode, multiple modes, or no mode at all. It's the only measure of central tendency that applies to categorical data (e.g., the most common shoe size sold at a store).

A useful check: if the mean and median are close together, the distribution is likely roughly symmetric. If they differ noticeably, that's a sign of skewness.

Variability

Measures of variability tell you how spread out the values are around the center.

Range is simply maximum minus minimum. It's easy to compute but sensitive to outliers since it depends on only two values.
Variance measures the average squared deviation from the mean. Squaring ensures that deviations above and below the mean don't cancel out.
- Sample variance: $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$
- Population variance: $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$ Notice the denominator difference. Sample variance divides by $n - 1$ (called Bessel's correction) because using $n$ would systematically underestimate the true population variance.
Standard deviation is the square root of the variance, which brings the units back to the original scale. Sample standard deviation is $s$ ; population standard deviation is $\sigma$ .
Coefficient of variation (CV) equals the standard deviation divided by the mean, often expressed as a percentage. It lets you compare variability across datasets with different units or scales. For instance, comparing the spread of heights (in cm) to the spread of weights (in kg) only makes sense using CV, not raw standard deviations.

Interpreting These Together

Higher values of range, variance, and standard deviation signal greater spread. Think of income data in a country with high inequality versus manufacturing measurements held to tight tolerances.
Comparing central tendency to variability reveals distribution shape. A dataset where the mean is much larger than the median, combined with a large standard deviation, likely has a right skew with some extreme high values.

Measures of central tendency and variability, Unit 2: Measures of central tendency of ungrouped data – National Curriculum (Vocational ...

Graphical Representations of Data Distributions

Histograms

A histogram divides a continuous variable's range into equal-width bins and shows the frequency (or relative frequency) of observations in each bin.

How to read one:

Check the overall shape: symmetric, left-skewed, right-skewed, or bimodal.
Identify where the center falls (the tallest cluster of bars).
Note the spread and whether there are gaps or isolated bars that suggest outliers.

Bin width matters. Too few bins can hide important features; too many bins create noisy, hard-to-read plots. Most software chooses a reasonable default, but you should experiment if the shape looks odd.

Box Plots

A box plot displays the five-number summary: minimum, first quartile ( $Q1$ ), median, third quartile ( $Q3$ ), and maximum.

The box spans from $Q1$ to $Q3$ , so its width represents the interquartile range (IQR), which captures the middle 50% of the data.
The line inside the box marks the median.
Whiskers extend to the smallest and largest values that are not outliers.
Outliers are plotted as individual points beyond the whiskers. The standard rule: any value more than $1.5 \times IQR$ below $Q1$ or above $Q3$ is flagged as an outlier.

Box plots are especially powerful for side-by-side comparisons. Placing box plots for different groups next to each other (e.g., salaries by education level) makes differences in center, spread, and outliers immediately visible.

What to Look For in Both

Shape: symmetric, left-skewed (tail stretches left), right-skewed (tail stretches right), or bimodal (two peaks).
Outliers: isolated points far from the bulk of the data. Housing price data, for example, often has a few extreme high values.
Group comparisons: side-by-side histograms or box plots let you compare distributions across categories (e.g., male vs. female heights).

Measures of central tendency and variability, Data Science for Water Professionals: Descriptive Statistics in R

Descriptive Statistics for Data Comparison

Categorical Data

Frequency tables show the count or percentage of observations in each category. These are the starting point for any categorical summary.
Bar charts display category frequencies as bars. Unlike histograms, the bars don't touch because the categories are discrete.
Pie charts show each category's proportion of the whole. They work for a small number of categories but become hard to read with many slices.

Numerical Data

Summarize with measures of central tendency and variability (mean, median, standard deviation).
Visualize with histograms and box plots to see the full distribution, not just summary numbers.

Comparing Groups

For categorical variables, use side-by-side bar charts or contingency tables. A contingency table cross-tabulates two categorical variables (e.g., political party affiliation broken down by age group) so you can spot patterns in the joint distribution.
For numerical variables, use side-by-side box plots or overlaid histograms. Calculate summary statistics for each group separately and compare them directly. For example, comparing average test scores across school districts becomes much clearer when you also compare the standard deviations to see which district has more consistency.

Population, Sample, and Distribution Characteristics

Population is the entire group you want to draw conclusions about. A sample is a subset of that population. We use sample statistics (like $\bar{x}$ and $s$ ) to estimate population parameters (like $\mu$ and $\sigma$ ).
Distribution describes the pattern of values in a dataset. It can be visualized with a histogram or described numerically with center, spread, and shape.
Z-scores standardize individual observations by measuring how many standard deviations a value falls from the mean: $z = \frac{x - \mu}{\sigma}$

A z-score of 2.0 means the value is 2 standard deviations above the mean. Z-scores let you compare values from different distributions on the same scale (e.g., comparing a student's score on two exams with different means and standard deviations).
Percentiles indicate the relative position of a value. The 25th percentile means 25% of the data fall at or below that value. The 50th percentile is the median, and the 25th and 75th percentiles correspond to $Q1$ and $Q3$ .
Correlation measures the strength and direction of the linear relationship between two numerical variables. Values range from $-1$ (perfect negative linear relationship) to $+1$ (perfect positive linear relationship), with $0$ indicating no linear association. Correlation does not imply causation, and it won't capture nonlinear patterns.