Fiveable
Fiveable
pep
Fiveable
Fiveable

or

Log in

Find what you need to study


Light

1.7 Summary Statistics for a Quantitative Variable

8 min readdecember 28, 2022

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

Attend a live cram event

Review all units live with expert teachers & students

Statistics is a measure taken from the sample to help us analyze the data. Meanwhile, a parameter is the measure taken from the population. In inferential statistics, we will use statistics to make inferences about the parameters. For now, we'll focus on summary statistics. , , , IQR, range, all are summary statistics for a quantitative variable.

  • The , , quartiles, and percentiles measure the center and position for quantitative data

  • The range, IQR, and measure the variability for quantitative data.

The summary measures change if we convert them to different units.

Statistics of Center

The Mean

, or average, as you learned before, is easy to calculate, we add up all the values of the variable and divide the sum by number. The formula follows: x̄ = ∑x / n x is read as an x bar; it’s the value of the x values of data. By the way, it doesn't need to be x; it can be y as well. Means are the best summary measures for a symmetric distribution because, as mentioned before, they are the balancing point of the distributions. However, the has few drawbacks.It does not tell about all individuals (that is why we also need summary measures of spread), and it is not resistant to outliers.

The number can easily be affected by one high value in our data set and affect our study results, leading us to make wrong decisions if we wrongly choose to report the instead of the

The Median

is the middle number of data. When data are even we calculate the by finding the average of the middle two numbers. Medians are good alternatives of summarizing the center of for skewed distributions or distribution with an outlier. The is resistant to outliers. However, it is not easy to find the from the histogram, but you don’t need to do it.

We will need only to find its position by dividing the total number of our data by 2. If the total amount is odd, we add one (n/2 for even cases and (n + 1)/2 for odd ones).

In the following section, when we compare two histograms, you will see how to find the from the histogram.

Mean or Median?

Rule of thumb time! 👍

If the distribution is symmetric and unimodal, the is often the best measure of central tendency because it takes into account all of the values in the dataset and reflects the overall trend in the data.

On the other hand, if the distribution is skewed or has outliers, the is often a better measure of central tendency because it is resistant to the influence of these values. In right-skewed distributions, the is generally higher than the , while in left-skewed distributions, the is generally lower than the .

It's always a good idea to report both the and when describing the statistical properties of a dataset, and to explain why they are different if they are not close to each other. This can help to provide a more complete picture of the distribution of the data and how it is dispersed around the center.

Likewise, remember to always report the units when describing summary measures of the center, as you would in any math class. This helps to provide context and allows others to interpret the results accurately.

Statistics of Spread

Standard Deviation

The is like lungs in statistics. You cannot breathe without it. You cannot analyze data without it. It shows how far or close the values are dispersed, deviated, or vary from the . The process of calculating is lengthy and time-consuming, and definitely, you already know by now. You will mostly rely on your calculator to do it for you, but in case here is the formula:

s = √[∑(x-x̄)^2/n-1]

You may wonder, if not already before, why subtract one from n? When calculating the for a sample, it is necessary to subtract 1 from the number of values (n) in the sample to account for the fact that the sample is a subset of the population and therefore has some level of sampling error. This is known as the "degrees of freedom" and it is used to adjust the variance estimate so that it is more accurate and more representative of the population.

As you read more units, you will revisit the concept of and will understand it more. 

Interquartile Range (IQR)

Recall that the interquartile range (IQR) is based on the difference between the upper and lower quartiles. It is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a data set.

IQR = upper quartile (Q3) - lower quartile (Q1).

The first quartile, Q1, is the of the half of the ordered data set from the minimum to the position of the . The third quartile, Q3, is the of the half of the ordered data set from the position of the to the maximum. Q1 and Q3 form the boundaries for the middle 50% of values in an ordered data set.

However, the IQR does not capture the entire distribution of values in the data set and therefore may not fully reflect the variability of the data. Other measures such as the range, , and variance can provide a more comprehensive view of the dispersion and variability in a data set. These measures are often used in conjunction with the IQR to provide a more complete understanding of the characteristics of a data set.

Standard Deviation or IQR?

It's generally true that the IQR is larger than the for symmetric distributions without outliers, although the specific relationship between these measures will depend on the characteristics of the data set.

  • For a symmetric, unimodal distribution, the and will be approximately equal, and the and IQR will provide complementary information about the dispersion of the data. In this case, it is appropriate to report both the and to provide a sense of the center and spread of the distribution.

  • For skewed distributions, the is often a better measure of central tendency than the , as the can be influenced by extreme values or outliers. In this case, it is appropriate to report both the and IQR to provide a sense of the center and spread of the distribution.

In general, report both measures of center and spread together is a good plan-of-action, as this provides a more complete understanding of the characteristics of a data set. Reporting only one measure, such as the or IQR, can be misleading or incomplete, as it does not provide a full picture of the data.

A Note About Outliers

Previously, we've talked about what outliers are, but how do we know a data point is an outlier or not? There are many methods for determining outliers. Two methods frequently used in this course are:

Method I: 1.5 x IQR

We can use the IQR to identify outliers involves calculating the IQR for the data set and then using this value to determine which values are outside the normal range of the data.

Specifically, values that are more than 1.5 × IQR above the third quartile (Q3) or more than 1.5 × IQR below the first quartile (Q1) are considered outliers. This method is based on the assumption that most of the values in the data set should fall within the range defined by the IQR, with only a small number of values falling outside this range.

Example

To determine whether a value is an outlier using the 1.5 × IQR method, you will need to calculate the interquartile range (IQR) for the data set and then compare the value to the upper and lower bounds of the data set. Here is an example of how this might be done:

Suppose you have the following data set: 10, 15, 20, 25, 30, 35, 40, 45, 50

To determine whether any of these values are outliers using the 1.5 × IQR method, you would first need to calculate the IQR. To do this, you would need to find the first quartile (Q1), the (Q2), and the third quartile (Q3).

For this data set, the first quartile (Q1) is 20, the (Q2) is 30, and the third quartile (Q3) is 40. The IQR is then calculated as the difference between Q3 and Q1, or 40 - 20 = 20.

Next, you would need to determine the upper and lower bounds of the data set using the IQR. The upper bound is calculated as Q3 + 1.5 × IQR, or 40 + (1.5 × 20) = 70. The lower bound is calculated as Q1 - 1.5 × IQR, or 20 - (1.5 × 20) = -10.

Finally, you would need to compare the value you are interested in to these bounds. If the value is less than the lower bound or greater than the upper bound, it is considered an outlier. For example, if the value you are interested in is 100, it is an outlier because it is greater than the upper bound of 70. If the value you are interested in is 5, it is not an outlier because it is within the bounds of the data set (-10 to 70).

Method II: Standard Deviations

We can also use standard deviations to identify outliers involves calculating the and for the data set and then using these values to determine which values are outside the normal range of the data. Specifically, values that are more than 2 standard deviations above or below the are considered outliers. This method is based on the assumption that most of the values in the data set should fall within two standard deviations of the , with only a small number of values falling outside this range.

Both of these methods can be useful for identifying unusual or unexpected values in a data set, but they may not be suitable for all types of data or in all situations. It is important to consider the characteristics of the data set and the goals of the analysis when deciding which method to use to identify outliers.

Resistance and Nonresistant Measures

The , , and range are considered nonresistant (or non-robust) because they are influenced by outliers. The and IQR are considered resistant (or robust), because outliers do not greatly (if at all) affect their value.

For these reasons, the and IQR are often preferred to the , , and range when working with data sets that may contain outliers. They are more robust and provide a more accurate representation of the center and spread of the data, even in the presence of extreme values.

Key Vocabulary

  • Mode

  • Range

  • IQR

  • Outliers

🎥 Watch: AP Stats - Unit 1 Streams

Key Terms to Review (5)

Mean

: The mean is the average of a set of numbers. It is found by adding up all the values and dividing by the total number of values.

Median

: The median is the middle value in a set of data when the values are arranged in order. It divides the data into two equal halves.

Nonresistant Measures

: Nonresistant measures are statistical measures that can be greatly influenced by extreme values or outliers in a dataset. They may not accurately represent the central tendency or spread of the data.

Resistant Measures

: Resistant measures are statistical measures that are not affected by extreme values or outliers in a dataset. They provide a robust summary of the data's central tendency and spread.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.

1.7 Summary Statistics for a Quantitative Variable

8 min readdecember 28, 2022

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

L

Lusine Ghazaryan

Jed Quiaoit

Jed Quiaoit

Attend a live cram event

Review all units live with expert teachers & students

Statistics is a measure taken from the sample to help us analyze the data. Meanwhile, a parameter is the measure taken from the population. In inferential statistics, we will use statistics to make inferences about the parameters. For now, we'll focus on summary statistics. , , , IQR, range, all are summary statistics for a quantitative variable.

  • The , , quartiles, and percentiles measure the center and position for quantitative data

  • The range, IQR, and measure the variability for quantitative data.

The summary measures change if we convert them to different units.

Statistics of Center

The Mean

, or average, as you learned before, is easy to calculate, we add up all the values of the variable and divide the sum by number. The formula follows: x̄ = ∑x / n x is read as an x bar; it’s the value of the x values of data. By the way, it doesn't need to be x; it can be y as well. Means are the best summary measures for a symmetric distribution because, as mentioned before, they are the balancing point of the distributions. However, the has few drawbacks.It does not tell about all individuals (that is why we also need summary measures of spread), and it is not resistant to outliers.

The number can easily be affected by one high value in our data set and affect our study results, leading us to make wrong decisions if we wrongly choose to report the instead of the

The Median

is the middle number of data. When data are even we calculate the by finding the average of the middle two numbers. Medians are good alternatives of summarizing the center of for skewed distributions or distribution with an outlier. The is resistant to outliers. However, it is not easy to find the from the histogram, but you don’t need to do it.

We will need only to find its position by dividing the total number of our data by 2. If the total amount is odd, we add one (n/2 for even cases and (n + 1)/2 for odd ones).

In the following section, when we compare two histograms, you will see how to find the from the histogram.

Mean or Median?

Rule of thumb time! 👍

If the distribution is symmetric and unimodal, the is often the best measure of central tendency because it takes into account all of the values in the dataset and reflects the overall trend in the data.

On the other hand, if the distribution is skewed or has outliers, the is often a better measure of central tendency because it is resistant to the influence of these values. In right-skewed distributions, the is generally higher than the , while in left-skewed distributions, the is generally lower than the .

It's always a good idea to report both the and when describing the statistical properties of a dataset, and to explain why they are different if they are not close to each other. This can help to provide a more complete picture of the distribution of the data and how it is dispersed around the center.

Likewise, remember to always report the units when describing summary measures of the center, as you would in any math class. This helps to provide context and allows others to interpret the results accurately.

Statistics of Spread

Standard Deviation

The is like lungs in statistics. You cannot breathe without it. You cannot analyze data without it. It shows how far or close the values are dispersed, deviated, or vary from the . The process of calculating is lengthy and time-consuming, and definitely, you already know by now. You will mostly rely on your calculator to do it for you, but in case here is the formula:

s = √[∑(x-x̄)^2/n-1]

You may wonder, if not already before, why subtract one from n? When calculating the for a sample, it is necessary to subtract 1 from the number of values (n) in the sample to account for the fact that the sample is a subset of the population and therefore has some level of sampling error. This is known as the "degrees of freedom" and it is used to adjust the variance estimate so that it is more accurate and more representative of the population.

As you read more units, you will revisit the concept of and will understand it more. 

Interquartile Range (IQR)

Recall that the interquartile range (IQR) is based on the difference between the upper and lower quartiles. It is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a data set.

IQR = upper quartile (Q3) - lower quartile (Q1).

The first quartile, Q1, is the of the half of the ordered data set from the minimum to the position of the . The third quartile, Q3, is the of the half of the ordered data set from the position of the to the maximum. Q1 and Q3 form the boundaries for the middle 50% of values in an ordered data set.

However, the IQR does not capture the entire distribution of values in the data set and therefore may not fully reflect the variability of the data. Other measures such as the range, , and variance can provide a more comprehensive view of the dispersion and variability in a data set. These measures are often used in conjunction with the IQR to provide a more complete understanding of the characteristics of a data set.

Standard Deviation or IQR?

It's generally true that the IQR is larger than the for symmetric distributions without outliers, although the specific relationship between these measures will depend on the characteristics of the data set.

  • For a symmetric, unimodal distribution, the and will be approximately equal, and the and IQR will provide complementary information about the dispersion of the data. In this case, it is appropriate to report both the and to provide a sense of the center and spread of the distribution.

  • For skewed distributions, the is often a better measure of central tendency than the , as the can be influenced by extreme values or outliers. In this case, it is appropriate to report both the and IQR to provide a sense of the center and spread of the distribution.

In general, report both measures of center and spread together is a good plan-of-action, as this provides a more complete understanding of the characteristics of a data set. Reporting only one measure, such as the or IQR, can be misleading or incomplete, as it does not provide a full picture of the data.

A Note About Outliers

Previously, we've talked about what outliers are, but how do we know a data point is an outlier or not? There are many methods for determining outliers. Two methods frequently used in this course are:

Method I: 1.5 x IQR

We can use the IQR to identify outliers involves calculating the IQR for the data set and then using this value to determine which values are outside the normal range of the data.

Specifically, values that are more than 1.5 × IQR above the third quartile (Q3) or more than 1.5 × IQR below the first quartile (Q1) are considered outliers. This method is based on the assumption that most of the values in the data set should fall within the range defined by the IQR, with only a small number of values falling outside this range.

Example

To determine whether a value is an outlier using the 1.5 × IQR method, you will need to calculate the interquartile range (IQR) for the data set and then compare the value to the upper and lower bounds of the data set. Here is an example of how this might be done:

Suppose you have the following data set: 10, 15, 20, 25, 30, 35, 40, 45, 50

To determine whether any of these values are outliers using the 1.5 × IQR method, you would first need to calculate the IQR. To do this, you would need to find the first quartile (Q1), the (Q2), and the third quartile (Q3).

For this data set, the first quartile (Q1) is 20, the (Q2) is 30, and the third quartile (Q3) is 40. The IQR is then calculated as the difference between Q3 and Q1, or 40 - 20 = 20.

Next, you would need to determine the upper and lower bounds of the data set using the IQR. The upper bound is calculated as Q3 + 1.5 × IQR, or 40 + (1.5 × 20) = 70. The lower bound is calculated as Q1 - 1.5 × IQR, or 20 - (1.5 × 20) = -10.

Finally, you would need to compare the value you are interested in to these bounds. If the value is less than the lower bound or greater than the upper bound, it is considered an outlier. For example, if the value you are interested in is 100, it is an outlier because it is greater than the upper bound of 70. If the value you are interested in is 5, it is not an outlier because it is within the bounds of the data set (-10 to 70).

Method II: Standard Deviations

We can also use standard deviations to identify outliers involves calculating the and for the data set and then using these values to determine which values are outside the normal range of the data. Specifically, values that are more than 2 standard deviations above or below the are considered outliers. This method is based on the assumption that most of the values in the data set should fall within two standard deviations of the , with only a small number of values falling outside this range.

Both of these methods can be useful for identifying unusual or unexpected values in a data set, but they may not be suitable for all types of data or in all situations. It is important to consider the characteristics of the data set and the goals of the analysis when deciding which method to use to identify outliers.

Resistance and Nonresistant Measures

The , , and range are considered nonresistant (or non-robust) because they are influenced by outliers. The and IQR are considered resistant (or robust), because outliers do not greatly (if at all) affect their value.

For these reasons, the and IQR are often preferred to the , , and range when working with data sets that may contain outliers. They are more robust and provide a more accurate representation of the center and spread of the data, even in the presence of extreme values.

Key Vocabulary

  • Mode

  • Range

  • IQR

  • Outliers

🎥 Watch: AP Stats - Unit 1 Streams

Key Terms to Review (5)

Mean

: The mean is the average of a set of numbers. It is found by adding up all the values and dividing by the total number of values.

Median

: The median is the middle value in a set of data when the values are arranged in order. It divides the data into two equal halves.

Nonresistant Measures

: Nonresistant measures are statistical measures that can be greatly influenced by extreme values or outliers in a dataset. They may not accurately represent the central tendency or spread of the data.

Resistant Measures

: Resistant measures are statistical measures that are not affected by extreme values or outliers in a dataset. They provide a robust summary of the data's central tendency and spread.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.