Descriptive statistics form the foundation of data analysis. They give you tools to summarize datasets, spot patterns, and communicate findings clearly. From calculating means and standard deviations to building histograms and scatter plots, these techniques show up constantly in mathematics, research, and applied fields.

This topic covers measures of central tendency, dispersion, data distribution, visualization, and correlation.

Measures of central tendency

A measure of central tendency identifies the "typical" or central value in a dataset. These measures let you boil down a large collection of numbers into a single representative value, which makes it much easier to compare datasets and draw conclusions.

Mean vs median vs mode

Mean is the arithmetic average: sum all values and divide by the number of observations.

$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$

The mean is sensitive to outliers. For example, if five people earn $40K, $45K, $50K, $55K, and $500K, the mean salary is $138K, which doesn't represent the typical earner at all.

Median is the middle value when data is sorted in order. For the salary example above, the median is $50K, which better reflects the group. The median is the preferred measure for skewed distributions precisely because it resists the pull of extreme values.

Mode is the most frequently occurring value. It's the only measure of central tendency that works for categorical data (like "favorite color"). A dataset can be bimodal (two modes), multimodal (more than two), or have no mode if all values appear equally often.

Weighted averages

Sometimes not all data points are equally important. A weighted average assigns different levels of importance (weights) to each value.

$\text{Weighted Average} = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$

A common example: your course grade might weight exams at 60% and homework at 40%. If you score 80 on exams and 95 on homework, your weighted average is $\frac{(0.6)(80) + (0.4)(95)}{0.6 + 0.4} = 86$ , not the simple average of 87.5. Weighted averages also appear in portfolio returns and survey analysis.

Geometric and harmonic means

These are specialized averages for specific types of data.

The geometric mean is the $n$ th root of the product of $n$ numbers:

$\text{Geometric Mean} = \sqrt[n]{x_1 \times x_2 \times \cdots \times x_n}$

Use it when data involves multiplicative growth. For instance, if an investment grows by 10%, then 20%, then -5% over three years, the geometric mean of the growth factors (1.10, 1.20, 0.95) gives the true average annual growth rate.

The harmonic mean is the reciprocal of the arithmetic mean of reciprocals:

$\text{Harmonic Mean} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}}$

It's the right choice for averaging rates. If you drive 60 km/h for one leg of a trip and 40 km/h for the return, the harmonic mean (48 km/h) gives the correct average speed, not the arithmetic mean of 50 km/h.

Measures of dispersion

Knowing the center of your data isn't enough. Two datasets can have the same mean but look completely different. Measures of dispersion tell you how spread out the data points are.

Range and interquartile range

Range is simply the maximum value minus the minimum value. It's fast to compute but easily distorted by a single outlier.

Interquartile range (IQR) captures the spread of the middle 50% of the data:

$\text{IQR} = Q3 - Q1$

Because it ignores the top and bottom 25%, the IQR is much more robust against outliers. It's also the basis for the "box" in a box plot.

Variance and standard deviation

Variance measures the average squared deviation from the mean. For a sample:

$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$

You divide by $n - 1$ (not $n$ ) for sample variance to correct for bias in estimation. For a full population, you divide by $n$ .

Standard deviation is the square root of variance:

$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}$

The key advantage of standard deviation over variance is that it's in the same units as your original data. If your data is in meters, the standard deviation is also in meters, while the variance would be in square meters.

Coefficient of variation

The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean:

$CV = \frac{s}{\bar{x}} \times 100$

This lets you compare variability between datasets that have different units or vastly different scales. For example, comparing the variability of heights (measured in cm) to weights (measured in kg) only makes sense using CV, not raw standard deviations.

Data distribution

The distribution of a dataset describes the overall pattern of how values are spread. Understanding the shape of your distribution matters because it determines which statistical methods are appropriate.

Skewness and kurtosis

Skewness measures asymmetry in a distribution.

Positive skew (right skew): the right tail is longer. Most values cluster on the left. Income distributions typically look like this.
Negative skew (left skew): the left tail is longer. Most values cluster on the right.
A perfectly symmetric distribution has a skewness of zero.

Kurtosis measures how heavy the tails are relative to a normal distribution.

A normal distribution has a kurtosis of 3 (called mesokurtic).
Leptokurtic distributions (kurtosis > 3) have heavier tails and a sharper peak, meaning more extreme values than you'd expect.
Platykurtic distributions (kurtosis < 3) have lighter tails and a flatter peak.

Some textbooks report "excess kurtosis," which subtracts 3 so that a normal distribution has excess kurtosis of 0.

Normal distribution

The normal (or Gaussian) distribution is the classic bell curve, symmetric around its mean. Its probability density function is:

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$

The empirical rule (68-95-99.7 rule) is worth memorizing:

About 68% of data falls within $\pm 1\sigma$ of the mean
About 95% within $\pm 2\sigma$
About 99.7% within $\pm 3\sigma$

The normal distribution shows up constantly in nature (heights, measurement errors) and forms the basis for many statistical tests.

Mean vs median vs mode, Skewness and the Mean, Median, and Mode – Adapted By Darlene Young Introductory Statistics

Probability density function

A probability density function (PDF) describes the relative likelihood of a continuous random variable taking on a given value. The PDF itself doesn't give you probabilities directly. Instead, you integrate the PDF over an interval to get the probability of falling within that interval.

Two key properties of any valid PDF:

The function is never negative: $f(x) \geq 0$ for all $x$
The total area under the curve equals 1

Every continuous distribution (normal, exponential, uniform, etc.) has its own PDF. The normal distribution's PDF above is just one example.

Data visualization

Graphs and charts turn raw numbers into something you can see and interpret quickly. Choosing the right visualization depends on what type of data you have and what relationship you're trying to show.

Histograms and bar charts

Histograms display the distribution of continuous data. The x-axis is divided into bins (ranges), and the height of each bar shows how many data points fall in that bin. Bars are adjacent with no gaps, reflecting the continuous nature of the data. Histograms are your go-to for visualizing shape, center, and spread.

Bar charts represent categorical data. Each bar corresponds to a category, and the height shows the count or value. Unlike histograms, bar charts have gaps between bars because the categories are distinct.

Box plots and whisker diagrams

A box plot packs five key statistics into one compact graphic:

The median (line inside the box)
Q1 and Q3 (the edges of the box, so the box itself represents the IQR)
Whiskers extending to the smallest and largest non-outlier values (typically within 1.5 × IQR of Q1 and Q3)
Outliers plotted as individual dots beyond the whiskers

Box plots are especially useful for comparing distributions across multiple groups side by side.

Scatter plots

Scatter plots show the relationship between two continuous variables. Each point represents one observation plotted at its $(x, y)$ coordinates. They're great for spotting:

Linear or non-linear trends
Clusters of data
Outliers
The strength and direction of a relationship

You can add extra dimensions by varying point size or color to represent additional variables.

Percentiles and quartiles

Percentiles and quartiles divide a dataset into equal parts, helping you understand where a particular value stands relative to the rest of the data.

Calculation methods

Percentiles divide data into 100 equal parts. The $k$ th percentile is the value below which $k\%$ of the data falls. The median is the 50th percentile.

Quartiles divide data into four equal parts:

Q1 = 25th percentile
Q2 = 50th percentile (median)
Q3 = 75th percentile

There are multiple methods for computing percentiles, including linear interpolation and the nearest-rank method. For small datasets, different methods can give slightly different results, so be aware of which method your software or textbook uses.

Interpretation of percentiles

If you score at the 90th percentile on a test, 90% of test-takers scored below you. Percentiles give individual data points context within the full distribution.

They're widely used in standardized testing, income analysis, and growth charts (pediatricians track children's height and weight by percentile). Extreme percentiles (very high or very low) can also flag potential outliers.

Applications in statistics

Box plots are built directly from quartiles
The IQR ( $Q3 - Q1$ ) is the standard measure of spread for box plots
Outlier detection: values below $Q1 - 1.5 \times \text{IQR}$ or above $Q3 + 1.5 \times \text{IQR}$ are typically flagged as outliers
Percentiles appear in non-parametric tests like the Mann-Whitney U test
Fields like education and finance rely heavily on percentile-based reporting

Correlation and covariance

These measures quantify the relationship between two variables. They tell you whether variables tend to move together, move in opposite directions, or show no consistent pattern.

Mean vs median vs mode, Skewness - Wikipedia

Pearson correlation coefficient

The Pearson correlation coefficient $r$ measures the strength and direction of a linear relationship between two continuous variables:

$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$

$r = +1$ : perfect positive linear relationship
$r = -1$ : perfect negative linear relationship
$r = 0$ : no linear relationship (but there could still be a non-linear one)

Two important caveats: Pearson's $r$ assumes the variables are roughly normally distributed, and it's sensitive to outliers. A single extreme point can dramatically change the value.

Spearman rank correlation

Spearman's rank correlation is a non-parametric alternative. Instead of using the raw data values, it ranks them from smallest to largest and then computes the correlation on the ranks.

This makes it:

Resistant to outliers
Able to detect monotonic relationships (consistently increasing or decreasing) even if they're not linear
Suitable for ordinal data (like survey responses on a 1-5 scale)

It also ranges from $-1$ to $+1$ and is interpreted similarly to Pearson's $r$ .

Covariance matrix

Covariance measures how two variables change together:

$\text{Cov}(X,Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n - 1}$

Positive covariance means the variables tend to increase together. Negative covariance means one tends to decrease as the other increases. Unlike correlation, covariance is not standardized, so its magnitude depends on the units of the variables.

A covariance matrix extends this to multiple variables at once. It's a square matrix where:

The diagonal entries are the variances of each variable
The off-diagonal entries are the pairwise covariances

Covariance matrices are central to multivariate analysis, principal component analysis (PCA), and portfolio optimization in finance.

Descriptive statistics in practice

Real-world data is messy. Before you can compute any of the statistics above, you typically need to clean and prepare your data. And once you have results, you need to present them clearly.

Data cleaning and preparation

Raw datasets almost always contain problems: missing values, duplicates, inconsistent formatting, or errors. Common steps include:

Identify missing values and decide how to handle them (delete the row, fill in with the mean/median, or use interpolation)
Remove duplicates that could skew your results
Standardize or normalize variables when you need to compare across different scales
Encode categorical variables (e.g., converting "Yes/No" to 1/0) for numerical analysis

Skipping data cleaning leads to unreliable results, so treat it as a required first step.

Outlier detection

Outliers are data points that fall far from the rest of the data. Common detection methods:

Z-score method: flag points more than 2 or 3 standard deviations from the mean
IQR method: flag points below $Q1 - 1.5 \times \text{IQR}$ or above $Q3 + 1.5 \times \text{IQR}$
Visual methods: box plots and scatter plots make outliers easy to spot

Once you find outliers, you need to decide what to do with them. Sometimes they're genuine extreme values worth keeping. Other times they're data entry errors that should be corrected or removed. The decision matters because it affects your results.

Summarizing large datasets

When working with large datasets, you need efficient ways to get an overview:

Compute key statistics (mean, median, standard deviation) for each variable
Use aggregation to summarize by groups or categories (e.g., average sales by region)
Build summary tables that show descriptive statistics side by side
For high-dimensional data, dimensionality reduction techniques like PCA can compress many variables into a few meaningful components
Combine numerical summaries with visualizations for a complete picture

Statistical software tools

Modern statistics relies on software. The right tool depends on your dataset size, the complexity of your analysis, and your comfort with programming.

Excel for basic analysis

Excel is the most accessible option. It handles small to medium datasets well and includes built-in functions for mean, median, standard deviation, and more. Pivot tables and basic charts cover many common needs. The Analysis ToolPak add-in extends its capabilities with tools for regression, histograms, and hypothesis testing. The main limitation is that Excel struggles with very large datasets and more advanced techniques.

R and Python for advanced statistics

Both are free, open-source, and widely used in research and industry.

R was built specifically for statistics. Key packages include ggplot2 (visualization), dplyr (data manipulation), and the tidyverse ecosystem.
Python is a general-purpose language with powerful data libraries: NumPy (numerical computing), Pandas (data manipulation), SciPy (statistical functions), and Matplotlib/Seaborn (visualization).

Both require programming skills but offer far more flexibility, reproducibility, and scalability than spreadsheet tools.

Visualization software options

Tableau and Power BI provide interactive, drag-and-drop visualization for business settings
Matplotlib and Seaborn (Python) offer highly customizable statistical graphics
ggplot2 (R) produces publication-quality plots with a consistent grammar of graphics
D3.js creates interactive, web-based visualizations

Your choice depends on your audience, the complexity of your data, and whether you need static reports or interactive dashboards.