Fiveable
Fiveable
pep
Fiveable
Fiveable

or

Log in

Find what you need to study


Light

1.6 Describing the Distribution of a Quantitative Variable

8 min readdecember 28, 2022

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Attend a live cram event

Review all units live with expert teachers & students

Once we finish organizing the set of data of our interest into a certain display of our choice, the next task is to describe the data. In other words we should tell what we see. There are three things that we should look for when trying to find trends and patterns: shape, center and spread. ❗

Shape

To describe the shape of the display, check the following:

  • Symmetry. If you fold the histogram, do you have have equal amounts of data on each side. If yes, then your data are symmetric! Think of the shape of a butterfly and what happens when you "fold" it in half. 🦋

Symmetry in a histogram can occur if the distribution of the data is symmetric around the central value. This means that if you were to fold the histogram in half, the two halves would be approximately mirror images of each other.

A symmetric distribution is one in which the values on either side of the central value (such as the median or mean) are roughly equal. For example, a bell-shaped curve is a symmetric distribution because the values on either side of the peak are roughly equal.

To determine if a histogram is symmetric, you can visually inspect the shape of the histogram and see if it appears to be roughly symmetrical. You can also use statistical measures such as the mean and median to determine if the distribution is symmetric.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-wMoG1j7cMvzI.png?alt=media&token=702d7e5a-b965-4f02-9817-efc0de9c51fa

Source: Padhai Time

It's worth noting that a histogram does not have to be perfectly symmetrical to be considered symmetric. Some degree of skewness or asymmetry is often present in real-world data, and a histogram may still be considered symmetric if the degree of asymmetry is relatively small.

  • Skewness. The shapes can be right-skew and left-skew, the least or highest number in distribution pulls it to its side, and so it makes it look skewed. The skewed distribution will have one tail longer than the other, whereas the symmetric distribution has equal tails. If the tail is longer at the left side, then it is called left skewed, and right skewed for the ones that the tail is longer on the right side. 🦝

Another way to think about skewness is that there are two types of skewness: positive skewness and negative skewness. Positive skewness occurs when the distribution is skewed to the right, with a long tail on the right side and a shorter tail on the left side. This means that the majority of the values in the distribution are clustered on the left side, with a few values on the right side that are much larger or smaller.

On the other hand, negative skewness occurs when the distribution is skewed to the left, with a long tail on the left side and a shorter tail on the right side. This means that the majority of the values in the distribution are clustered on the right side, with a few values on the left side that are much smaller or larger.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-HwUDbBMnHVES.png?alt=media&token=1d380d41-d96c-4a53-b4ee-befb7f4109e5

Souce: ResearchGate

  • Peaks (modes). A mode represents the most frequent value or values in a distribution. In a histogram, stemplot, dotplot, or other graphical representation of data (except for boxplots), the mode is often indicated by the peak or peaks in the distribution. ⛰️

A distribution can have one mode, in which case it is called a unimodal distribution, or it can have two or more modes, in which case it is called a multimodal distribution. A bimodal distribution, for example, is a distribution with two modes.

It's worth noting that a distribution can have a mode even if it is not symmetrical or has skewed data. For example, a positively skewed distribution (with a long tail on the right side) can still have a mode if there is a value or values that occur more frequently than any other values in the distribution.

Uniform distributions, on the other hand, do not have a mode because all of the values in the distribution occur with roughly the same frequency. In a uniform distribution, there is no single value that stands out as being more common than any other value.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-pb752WOVxfOR.jpg?alt=media&token=7a546a5e-f15c-4fb8-8fd1-f760c390dfa9

Source: Towards Data Science

It's important to be aware of the number and location of modes in a distribution because they can provide valuable insights into the underlying data and how it is distributed. For example, the presence of two modes in a distribution may indicate the presence of two distinct groups or subpopulations within the data.

  • Outlier. Beware of outliers. Outliers are values in a dataset that are significantly different from the majority of the other values in the dataset. They can be either extremely high or extremely low, and they can have a significant impact on statistical measures such as the mean, median, and range of the data. 🗺️

It's important to be aware of outliers in a dataset because they can skew the results of statistical analyses and cause them to be less representative of the underlying data. For this reason, it is often useful to analyze data both with and without outliers to see how they affect the results.

There are a few different ways to identify outliers in a dataset. One way is to use graphical methods such as boxplots, which can help you visualize the distribution of the data and identify any values that are significantly different from the rest of the data. You can also use statistical measures such as the mean and standard deviation to identify outliers.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-5aBCRd37Bllz.png?alt=media&token=a3777a9d-9ec2-427c-8a8b-edec98e2a208

Source: Minitab

It's important to remember that outliers are not necessarily bad data, and they should not be automatically excluded from analysis. However, it is important to consider whether the outliers are representative of the underlying data or whether they may be the result of errors or other factors that need to be taken into account when interpreting the results of the analysis.

  • Gaps. Gaps in data help us detect multiple modes and warn us about different groups of data sources. 🌉

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-4wvB8VHcaSLd.png?alt=media&token=7de14d5f-7384-409f-ab70-9c03f5969822

Source: Medium

Center

There are three commonly used measures of the "center" of a distribution: mean, median, and mode. 🎯

  • The mean, also known as the average, is calculated by summing all of the values in a dataset and dividing by the number of values. It is often considered the best measure of central tendency for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall trend in the data.

  • The median is the middle value in a dataset when the values are ordered from least to greatest. It is often considered a better measure of central tendency for skewed distributions because it is resistant to the influence of outliers (values that are significantly different from the majority of the other values in the dataset).

  • The mode is the value that occurs most frequently in a dataset. It is a useful measure of central tendency when there are a few values that occur much more frequently than the others.

In a symmetric distribution, the mean, median, and mode are often close to each other or even equal, depending on the exact shape of the distribution. However, in skewed distributions or datasets with outliers, the mean, median, and mode can be significantly different from each other. It's important to consider which measure of central tendency is most appropriate for a given dataset, taking into account the symmetry or skewness of the data as well as the presence of outliers.

Spread

The center is a good measure, but it's definitely not perfect if we don’t report it with the spread. There are several measures that can be used to describe the spread or dispersion of a dataset, including the range, standard deviation, and interquartile range (IQR). 🕸️

  • The range is calculated by subtracting the minimum value in a dataset from the maximum value. While it can be a useful measure in some cases, it has the disadvantage of not taking into account the values of all of the data points, only the maximum and minimum values. As a result, it may not accurately reflect the true variability in the data.

  • The standard deviation measures the dispersion of a dataset around the mean. It is calculated by taking the square root of the variance, which is the average of the squared differences between each value in the dataset and the mean. The standard deviation is a useful measure for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall pattern of the data.

  • The interquartile range (IQR) is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). The first quartile is the value that divides the bottom 25% of the data from the top 75%, while the third quartile is the value that divides the bottom 75% of the data from the top 25%. The IQR is often used to describe the spread of skewed distributions or datasets with outliers because it is resistant to the influence of these values.

In general, it is a good practice to report both the center and spread of a dataset when describing its statistical properties. In symmetric distributions, it is common to report the mean with the standard deviation, while in skewed distributions, it is common to report the median with the IQR. This allows you to provide a more complete description of the distribution of the data and how it is dispersed around the center.

As we go through the next couple of sections, you'll be more familiar and comfortable describing distributions in terms of shape, center, and spread... woo-hoo! 🤩

Key Vocabulary

  • Shape

  • Center

  • Spread

  • Outliers

  • Symmetric

  • Skewed

Key Terms to Review (16)

Gaps

: Gaps refer to the spaces or intervals between data points in a dataset. They represent missing values or areas where no data is present.

Histogram

: A histogram is a graphical representation that displays the distribution or frequency of data in intervals or bins. It consists of bars where the height represents the frequency or count within each interval.

Interquartile Range (IQR)

: The interquartile range is the range of values that lie between the first quartile (Q1) and the third quartile (Q3). It measures the spread or variability of a dataset by focusing on the middle 50% of the data.

Mean

: The mean is the average of a set of numbers. It is found by adding up all the values and dividing by the total number of values.

Median

: The median is the middle value in a set of data when the values are arranged in order. It divides the data into two equal halves.

Mode

: The mode is the value or values that appear most frequently in a dataset.

Modes

: Modes refer to the values in a dataset that occur most frequently. In other words, they are the peaks or high points on a graph.

Multimodal Distribution

: A multimodal distribution refers to a dataset that has more than one peak or mode. It means that there are multiple values in the data set that occur with the highest frequency.

Outlier

: An outlier is a data point that significantly deviates from the rest of the dataset, either being unusually high or low. It is an observation that lies far away from most other observations.

Range

: The range is the difference between the largest and smallest values in a data set.

Right-Skew

: Right-skew refers to a distribution where the tail of the data points towards the right side. In other words, it means that there are more extreme values on the right side of the distribution.

Skewness

: Skewness refers to the asymmetry or lack of symmetry in a distribution. It indicates whether data is concentrated on one side or stretched out on both sides.

Spread

: Spread refers to how much variability or dispersion exists within a data set. It measures how far apart the values are from each other.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.

Symmetry

: Symmetry refers to a balanced distribution where the left and right sides are mirror images of each other. In statistics, it means that the data is evenly distributed around the center.

Uniform Distribution

: A uniform distribution refers to a dataset where all values have an equal probability of occurring. In other words, every value in the dataset is equally likely to be selected.

1.6 Describing the Distribution of a Quantitative Variable

8 min readdecember 28, 2022

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Attend a live cram event

Review all units live with expert teachers & students

Once we finish organizing the set of data of our interest into a certain display of our choice, the next task is to describe the data. In other words we should tell what we see. There are three things that we should look for when trying to find trends and patterns: shape, center and spread. ❗

Shape

To describe the shape of the display, check the following:

  • Symmetry. If you fold the histogram, do you have have equal amounts of data on each side. If yes, then your data are symmetric! Think of the shape of a butterfly and what happens when you "fold" it in half. 🦋

Symmetry in a histogram can occur if the distribution of the data is symmetric around the central value. This means that if you were to fold the histogram in half, the two halves would be approximately mirror images of each other.

A symmetric distribution is one in which the values on either side of the central value (such as the median or mean) are roughly equal. For example, a bell-shaped curve is a symmetric distribution because the values on either side of the peak are roughly equal.

To determine if a histogram is symmetric, you can visually inspect the shape of the histogram and see if it appears to be roughly symmetrical. You can also use statistical measures such as the mean and median to determine if the distribution is symmetric.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-wMoG1j7cMvzI.png?alt=media&token=702d7e5a-b965-4f02-9817-efc0de9c51fa

Source: Padhai Time

It's worth noting that a histogram does not have to be perfectly symmetrical to be considered symmetric. Some degree of skewness or asymmetry is often present in real-world data, and a histogram may still be considered symmetric if the degree of asymmetry is relatively small.

  • Skewness. The shapes can be right-skew and left-skew, the least or highest number in distribution pulls it to its side, and so it makes it look skewed. The skewed distribution will have one tail longer than the other, whereas the symmetric distribution has equal tails. If the tail is longer at the left side, then it is called left skewed, and right skewed for the ones that the tail is longer on the right side. 🦝

Another way to think about skewness is that there are two types of skewness: positive skewness and negative skewness. Positive skewness occurs when the distribution is skewed to the right, with a long tail on the right side and a shorter tail on the left side. This means that the majority of the values in the distribution are clustered on the left side, with a few values on the right side that are much larger or smaller.

On the other hand, negative skewness occurs when the distribution is skewed to the left, with a long tail on the left side and a shorter tail on the right side. This means that the majority of the values in the distribution are clustered on the right side, with a few values on the left side that are much smaller or larger.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-HwUDbBMnHVES.png?alt=media&token=1d380d41-d96c-4a53-b4ee-befb7f4109e5

Souce: ResearchGate

  • Peaks (modes). A mode represents the most frequent value or values in a distribution. In a histogram, stemplot, dotplot, or other graphical representation of data (except for boxplots), the mode is often indicated by the peak or peaks in the distribution. ⛰️

A distribution can have one mode, in which case it is called a unimodal distribution, or it can have two or more modes, in which case it is called a multimodal distribution. A bimodal distribution, for example, is a distribution with two modes.

It's worth noting that a distribution can have a mode even if it is not symmetrical or has skewed data. For example, a positively skewed distribution (with a long tail on the right side) can still have a mode if there is a value or values that occur more frequently than any other values in the distribution.

Uniform distributions, on the other hand, do not have a mode because all of the values in the distribution occur with roughly the same frequency. In a uniform distribution, there is no single value that stands out as being more common than any other value.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-pb752WOVxfOR.jpg?alt=media&token=7a546a5e-f15c-4fb8-8fd1-f760c390dfa9

Source: Towards Data Science

It's important to be aware of the number and location of modes in a distribution because they can provide valuable insights into the underlying data and how it is distributed. For example, the presence of two modes in a distribution may indicate the presence of two distinct groups or subpopulations within the data.

  • Outlier. Beware of outliers. Outliers are values in a dataset that are significantly different from the majority of the other values in the dataset. They can be either extremely high or extremely low, and they can have a significant impact on statistical measures such as the mean, median, and range of the data. 🗺️

It's important to be aware of outliers in a dataset because they can skew the results of statistical analyses and cause them to be less representative of the underlying data. For this reason, it is often useful to analyze data both with and without outliers to see how they affect the results.

There are a few different ways to identify outliers in a dataset. One way is to use graphical methods such as boxplots, which can help you visualize the distribution of the data and identify any values that are significantly different from the rest of the data. You can also use statistical measures such as the mean and standard deviation to identify outliers.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-5aBCRd37Bllz.png?alt=media&token=a3777a9d-9ec2-427c-8a8b-edec98e2a208

Source: Minitab

It's important to remember that outliers are not necessarily bad data, and they should not be automatically excluded from analysis. However, it is important to consider whether the outliers are representative of the underlying data or whether they may be the result of errors or other factors that need to be taken into account when interpreting the results of the analysis.

  • Gaps. Gaps in data help us detect multiple modes and warn us about different groups of data sources. 🌉

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-4wvB8VHcaSLd.png?alt=media&token=7de14d5f-7384-409f-ab70-9c03f5969822

Source: Medium

Center

There are three commonly used measures of the "center" of a distribution: mean, median, and mode. 🎯

  • The mean, also known as the average, is calculated by summing all of the values in a dataset and dividing by the number of values. It is often considered the best measure of central tendency for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall trend in the data.

  • The median is the middle value in a dataset when the values are ordered from least to greatest. It is often considered a better measure of central tendency for skewed distributions because it is resistant to the influence of outliers (values that are significantly different from the majority of the other values in the dataset).

  • The mode is the value that occurs most frequently in a dataset. It is a useful measure of central tendency when there are a few values that occur much more frequently than the others.

In a symmetric distribution, the mean, median, and mode are often close to each other or even equal, depending on the exact shape of the distribution. However, in skewed distributions or datasets with outliers, the mean, median, and mode can be significantly different from each other. It's important to consider which measure of central tendency is most appropriate for a given dataset, taking into account the symmetry or skewness of the data as well as the presence of outliers.

Spread

The center is a good measure, but it's definitely not perfect if we don’t report it with the spread. There are several measures that can be used to describe the spread or dispersion of a dataset, including the range, standard deviation, and interquartile range (IQR). 🕸️

  • The range is calculated by subtracting the minimum value in a dataset from the maximum value. While it can be a useful measure in some cases, it has the disadvantage of not taking into account the values of all of the data points, only the maximum and minimum values. As a result, it may not accurately reflect the true variability in the data.

  • The standard deviation measures the dispersion of a dataset around the mean. It is calculated by taking the square root of the variance, which is the average of the squared differences between each value in the dataset and the mean. The standard deviation is a useful measure for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall pattern of the data.

  • The interquartile range (IQR) is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). The first quartile is the value that divides the bottom 25% of the data from the top 75%, while the third quartile is the value that divides the bottom 75% of the data from the top 25%. The IQR is often used to describe the spread of skewed distributions or datasets with outliers because it is resistant to the influence of these values.

In general, it is a good practice to report both the center and spread of a dataset when describing its statistical properties. In symmetric distributions, it is common to report the mean with the standard deviation, while in skewed distributions, it is common to report the median with the IQR. This allows you to provide a more complete description of the distribution of the data and how it is dispersed around the center.

As we go through the next couple of sections, you'll be more familiar and comfortable describing distributions in terms of shape, center, and spread... woo-hoo! 🤩

Key Vocabulary

  • Shape

  • Center

  • Spread

  • Outliers

  • Symmetric

  • Skewed

Key Terms to Review (16)

Gaps

: Gaps refer to the spaces or intervals between data points in a dataset. They represent missing values or areas where no data is present.

Histogram

: A histogram is a graphical representation that displays the distribution or frequency of data in intervals or bins. It consists of bars where the height represents the frequency or count within each interval.

Interquartile Range (IQR)

: The interquartile range is the range of values that lie between the first quartile (Q1) and the third quartile (Q3). It measures the spread or variability of a dataset by focusing on the middle 50% of the data.

Mean

: The mean is the average of a set of numbers. It is found by adding up all the values and dividing by the total number of values.

Median

: The median is the middle value in a set of data when the values are arranged in order. It divides the data into two equal halves.

Mode

: The mode is the value or values that appear most frequently in a dataset.

Modes

: Modes refer to the values in a dataset that occur most frequently. In other words, they are the peaks or high points on a graph.

Multimodal Distribution

: A multimodal distribution refers to a dataset that has more than one peak or mode. It means that there are multiple values in the data set that occur with the highest frequency.

Outlier

: An outlier is a data point that significantly deviates from the rest of the dataset, either being unusually high or low. It is an observation that lies far away from most other observations.

Range

: The range is the difference between the largest and smallest values in a data set.

Right-Skew

: Right-skew refers to a distribution where the tail of the data points towards the right side. In other words, it means that there are more extreme values on the right side of the distribution.

Skewness

: Skewness refers to the asymmetry or lack of symmetry in a distribution. It indicates whether data is concentrated on one side or stretched out on both sides.

Spread

: Spread refers to how much variability or dispersion exists within a data set. It measures how far apart the values are from each other.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.

Symmetry

: Symmetry refers to a balanced distribution where the left and right sides are mirror images of each other. In statistics, it means that the data is evenly distributed around the center.

Uniform Distribution

: A uniform distribution refers to a dataset where all values have an equal probability of occurring. In other words, every value in the dataset is equally likely to be selected.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.