Descriptive statistics and data analysis give you tools to summarize datasets, spot patterns, and draw conclusions from numbers. These techniques show up constantly in more advanced math and science courses, so building a strong foundation here pays off. This section covers measures of central tendency, dispersion, distribution shape, data visualization, and methods for comparing datasets.
Data Summarization and Interpretation

Measures of Central Tendency
Mean is the average value: sum all data points and divide by how many there are. The mean is sensitive to outliers, meaning a single unusually high or low value can pull it significantly up or down.
Median is the middle value when you arrange the data from least to greatest. If there's an even number of data points, you average the two middle values. Because the median isn't dragged around by extreme values, it's often a better measure of center for skewed distributions.
Mode is the most frequently occurring value. A dataset can have no mode (every value appears once), one mode (unimodal), or multiple modes (bimodal, multimodal). Mode is especially useful for categorical data where mean and median don't apply.
When the mean and median are close together, the data is roughly symmetric. When they differ noticeably, that's a sign of skewness. If the mean is greater than the median, expect a right skew; if the mean is less, expect a left skew.
Measures of Dispersion
These statistics tell you how spread out the data is.
- Range = maximum value − minimum value. It gives a quick sense of spread but is heavily influenced by outliers.
- Example: For {10, 15, 20, 25, 30}, the range is .
- Interquartile Range (IQR) captures the spread of the middle 50% of the data: . Because it ignores the extremes, it's much more resistant to outliers than the range.
- Example: For {10, 15, 20, 25, 30}, , , so .
Variance measures the average squared deviation from the mean. The formula for sample variance is:
You divide by (not ) when working with a sample, which corrects for the tendency of a sample to underestimate the population's variability. This is called Bessel's correction.
Standard deviation is the square root of the variance:
Standard deviation is more interpretable than variance because it's in the same units as the original data. A small standard deviation means data points cluster tightly around the mean; a large one means they're more spread out.
Distribution Shape
- Skewness measures asymmetry.
- Positive skew (right skew): the right tail is longer, and most data clusters to the left.
- Negative skew (left skew): the left tail is longer, and most data clusters to the right.
- A skewness value near zero suggests a roughly symmetric distribution.
- Kurtosis measures how peaked or flat a distribution is compared to a normal distribution.
- Leptokurtic (positive excess kurtosis): sharper peak, heavier tails.
- Platykurtic (negative excess kurtosis): flatter peak, lighter tails.
- A normal distribution has a kurtosis of 3 (or an excess kurtosis of 0, depending on which formula you're using). Watch for which convention your class uses.
Data Visualization
Univariate Data
- Histograms show the distribution of a continuous variable. Data is grouped into bins (intervals), and the height of each bar represents the frequency or relative frequency in that bin. Use histograms to quickly assess shape, center, and spread.
- Bar charts compare frequencies or values across categories of a discrete variable. Unlike histograms, the bars don't touch because the categories aren't continuous.
- Pie charts display the proportion of data in each category as slices of a circle. They're best for showing composition of a whole but become hard to read with many categories.
- Stem-and-leaf plots split each data value into a "stem" (leading digits) and a "leaf" (trailing digit), giving a compact view of the distribution while preserving individual data values. You can read off the median, mode, and shape directly from the plot.
Bivariate Data
- Scatter plots show the relationship between two continuous variables. Each point represents one observation, with one variable on the x-axis and the other on the y-axis. Look for direction (positive/negative), form (linear/curved), and strength (tight cluster vs. wide spread).
- Line graphs connect data points to emphasize trends over time or another ordered variable. They're the go-to choice for time series data.
Data Analysis and Comparison
Box Plots
Box plots (box-and-whisker plots) give a five-number summary at a glance: minimum, , median, , and maximum.
- The box spans from to , representing the IQR.
- The line inside the box marks the median.
- The whiskers extend to the farthest data points that fall within of the box edges.
- Any points beyond the whiskers are plotted individually as potential outliers.
Side-by-side box plots are powerful for comparing distributions. For example, placing box plots of test scores from three different classes next to each other lets you quickly compare their centers, spreads, and outliers.

Quantile-Quantile Plots
Q-Q plots compare two distributions by plotting their quantiles against each other. If the two distributions are similar, the points fall along a roughly straight line. Deviations from that line reveal differences in shape, spread, or skewness. The most common use in this course is checking whether a dataset follows a normal distribution by plotting its quantiles against theoretical normal quantiles.
Cumulative Frequency Plots
Cumulative frequency plots (ogives) show the running total of frequencies as you move through the data values. The y-axis represents the cumulative frequency (or cumulative relative frequency), and the x-axis represents the data values.
These are especially useful for finding percentiles. For example, to find the percentage of students who scored below 75 on an exam, you'd locate 75 on the x-axis and read the cumulative percentage off the y-axis.
Data-Driven Conclusions
Identifying Patterns and Relationships
Start by examining summary statistics and visualizations together. Look for:
- Consistent increases, decreases, or stability over time or across categories
- Clusters, gaps, or outliers in scatter plots
- Whether differences between groups appear meaningful
The correlation coefficient () quantifies the strength and direction of a linear relationship between two variables.
- ranges from to .
- Values close to indicate a strong positive linear relationship (both variables increase together).
- Values close to indicate a strong negative linear relationship (one increases as the other decreases).
- Values near suggest no linear relationship. Note that doesn't mean no relationship at all; there could still be a curved or nonlinear pattern.
Limitations and Considerations
- Sample size matters. A small sample may not represent the larger population well, and statistics calculated from small samples are less reliable.
- Correlation ≠ causation. A strong correlation between two variables doesn't prove one causes the other. The classic example: ice cream sales and shark attacks are positively correlated, but neither causes the other. Both increase during hot weather, which is a confounding variable.
- Use careful language. Say "the data suggests" or "there is evidence to support" rather than making absolute claims. Acknowledging uncertainty is a sign of good statistical reasoning, not weakness.
- Practical vs. statistical significance. A result can be statistically significant (unlikely due to chance) but practically meaningless. If two groups' average test scores differ by 0.5 points on a 100-point exam, that difference probably doesn't matter in the real world, even if a statistical test flags it as significant.