Fiveable
Fiveable
pep
Fiveable
Fiveable

or

Log in

Find what you need to study


Light

Unit 6 Overview: Inference for Categorical Data: Proportions

7 min readโ€ขjanuary 3, 2023

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Inference Who?

"This unit introduces statistical inference, which will continue through the end of the course. Students will analyze to make inferences about . Provided conditions are met, students will use to construct and interpret confidence intervals to estimate and perform significance tests to evaluate claims about population proportions. Students begin by learning inference procedures for one proportion and then examine inference methods for a difference between two proportions. They will also interpret the two types of errors that can be made in a significance test, their probabilities, and possible consequences in context." -- College Board

Have you ever seen a statistic perhaps on Facebook or Twitter and had your doubts? Maybe you read a statistic such as this one: "The proportion of goofy footed snowboarders who contract cancer is higher than those that are regular footed."

Sounds pretty goofy, right? ๐Ÿคช

It's certainly possible to come across statistics that seem questionable or that you might have doubts about. When encountering a statistic like this, it's always a good idea to try to verify the information and consider the context in which it is presented. This might involve looking for additional sources or seeking out more information about the study or data that the statistic is based on. It's also important to be aware of potential biases or agendas that might be influencing the way the statistic is presented. ๐Ÿง 

The process that scientists and data analysts use to make that conclusion comes from a process called . Inference is a process where a study is performed on a small sample of a population in which we compare two groups or perhaps one group to a given population. Through calculations involving the normal distribution, we can estimate what the true population parameter is or we can test a claim about a population given in an article of study using our sample statistics.

๐Ÿ’ก BIG IDEAS:

  • To estimate or predict a population parameter, we use a confidence interval!

  • To test a claim, we use a significance test!

Confidence Intervals

For this unit, we are going to be estimating population parameters involving . This means that our sample statistic will be a and we will be using that to estimate, or test against, a population proportion.

The first process we are going to use is a confidence interval. A confidence interval is an interval of numbers based on our that gives us a range where we can expect to find the true population proportion. A confidence interval will be based on three things: , sample size, and (usually 95%).

(1) Sample Proportion

It's important that the sample used to calculate a statistic be randomly selected in order to accurately represent the population. If the sample is not randomly selected, it can be biased and the resulting statistic (like the ) may not accurately represent the population. For example, if the sample used to calculate a statistic is not representative of the entire population, the statistic may not be a good estimate of the population parameter. This is why random sampling is important โ€“ it helps to ensure that the sample is representative of the population and that the resulting statistic is a good estimate of the population parameter. ๐ŸŽฉ

That being said, the first aspect of our confidence interval is our . In order for our to be a good estimate of our population proportion, it is necessary that it comes from a random sample. As mentioned before, there is no way to fix the lack of in a sample.

(2) Sample Size

Our sample size is also an important measure when used to calculate a confidence interval. Our sample size must be large enough that we can use a normal distribution to estimate our population proportion. In order to see that condition, refer back to what we said in Unit 5 with .

The of the sampling distribution of a statistic decreases as the sample size increases. This means that as the sample size increases, the sample statistic is less likely to be far from the true population parameter. As a result, the confidence interval for the population parameter will be narrower for a larger sample size.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-svebL9Ybd6Ue.png?alt=media&token=d4709e1d-7bef-4ccf-b003-23d8f51e14b8

Source: Towards Data Science

For example, let's say we are trying to estimate the proportion of people in a population who support a certain policy. If we have a sample size of 50, the of the sampling distribution of the will be larger than if we had a sample size of 500. This means that the confidence interval for the population proportion will be wider for a sample size of 50 compared to a sample size of 500. In other words, we'll be less confident in our estimate of the population proportion if we have a smaller sample size. ๐Ÿ˜ข

(3) Confidence Level

The is a measure of how confident we are that the confidence interval contains the true population parameter (i.e., true proportion of our population). It is expressed as a percentage and is calculated by considering the number of that would contain the true population parameter if we were to take multiple samples from the same population and construct for each sample.

The is chosen by the researcher and is typically set at a higher level, such as 90%, 95%, or 99%, to increase the confidence that the true population parameter is contained within the confidence interval. A higher will result in a wider confidence interval, but it will also increase the likelihood that the interval contains the true population parameter.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-8FaPTeeyxCD4.png?alt=media&token=1be46971-b37d-44a0-bff5-1518746a5918

Source: Lumen Learning

For example, if we set the to 95%, this means that if we were to take multiple samples from the same population and create a confidence interval for each sample, approximately 95% of those intervals would contain the true population parameter.

Another way to think of this is if we were to take 100 different samples from the same population and create 100 different 95% , ~95 of those 100 would contain the true proportion we are trying to estimate.

Our is also a key part of our because it determines our z* or based on the standard normal distribution. As our increases, so does our z*, which in turn increases the range of our confidence interval. ๐Ÿ‘

Significance Tests

When we are given a population parameter and we have some reason to believe that it is false, we can perform a significance test to check if that value is correct. With a significance test, we are going to estimate the probability of obtaining our collected sample from the sampling distribution of our sample size when we assume that the given population proportion is correct. If the probability of obtaining our collected sample is low given those two factors (claimed population proportion and our sample size), we might have reason to reject the claim or at least investigate it further. ๐Ÿ•ต๏ธ

As we had with and , our significance test hinges on the fact that we must meet the three conditions of inference: , and normality. Otherwise, our sample isn't reflective of the population, our isn't accurate, or our sampling distribution isn't normal so we cannot accurately calculate the probability of obtaining our sample. ๐Ÿ“–

Inference with Two Proportions

Just as we mentioned in Unit 5, we also may have to create or perform with two proportions. This is typically used in when comparing two samples to see the effectiveness of certain treatments. 2๏ธโƒฃ

As mentioned in Unit 5, our conditions for inference must be met with both samples and we can subtract our two centers to find the center of the sampling distribution between two proportions. The for this sampling distribution can be found on the reference page provided for AP testing.

For example, if a researcher is testing the effectiveness of a particular medicine or drug, the would randomly assign participants to a placebo group or the new treatment group. We would assume that there is no difference in the two groups and then compare the sample proportions of who recovered quicker between the two groups and if that difference is significant, then we would have an effective medicinal treatment.

๐ŸŽฅ Watch: AP Stats -- Unit 6

Key Terms to Review (14)

Binomial Population Proportions

: Binomial population proportions refer to situations where we are interested in estimating or making inferences about proportions within two possible outcomes (success/failure). It involves counting how many times an event occurs within a fixed number of trials.

Categorical Data

: Categorical data refers to data that can be divided into categories or groups based on qualitative characteristics.

Confidence Intervals

: Confidence intervals are ranges of values calculated from sample data that are likely to contain an unknown population parameter with a certain level of confidence.

Confidence Level

: Confidence level refers to how confident we are that our interval estimate contains or captures the true population parameter. It represents our degree of certainty or reliability in estimating this parameter.

Critical Value

: A critical value is a specific value that separates the rejection region from the non-rejection region in hypothesis testing. It is compared to the test statistic to determine whether to reject or fail to reject the null hypothesis.

Experimental Design

: Experimental design refers to the process of planning and conducting an experiment to investigate cause-and-effect relationships between variables. It involves defining treatments, assigning participants to different groups, and controlling for confounding factors to ensure valid results.

Independence

: Independence refers to events or variables that do not influence each other. If two events are independent, knowing one event occurred does not affect our knowledge about whether or not the other event will occur.

Population Proportions

: Population proportions refer to the proportion or percentage of a specific characteristic or attribute within an entire population.

Randomness

: Randomness refers to an unpredictable and haphazard pattern where each outcome has an equal chance of occurring. It plays a crucial role in statistical experiments and sampling techniques.

Sample Proportion

: The sample proportion is the ratio of the number of successes in a sample to the total number of observations in that sample.

Sampling Distributions

: Sampling distributions refer to the probability distributions that describe statistics calculated from samples taken from populations. They help us make inferences about population parameters based on sample statistics.

Significance Tests

: Significance tests help determine whether an observed effect or difference between groups is statistically significant or simply due to chance variation.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.

Statistical Inference

: Statistical inference involves using sample data to make conclusions or predictions about a larger population. It allows us to draw meaningful insights and make decisions based on limited information.

Unit 6 Overview: Inference for Categorical Data: Proportions

7 min readโ€ขjanuary 3, 2023

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Inference Who?

"This unit introduces statistical inference, which will continue through the end of the course. Students will analyze to make inferences about . Provided conditions are met, students will use to construct and interpret confidence intervals to estimate and perform significance tests to evaluate claims about population proportions. Students begin by learning inference procedures for one proportion and then examine inference methods for a difference between two proportions. They will also interpret the two types of errors that can be made in a significance test, their probabilities, and possible consequences in context." -- College Board

Have you ever seen a statistic perhaps on Facebook or Twitter and had your doubts? Maybe you read a statistic such as this one: "The proportion of goofy footed snowboarders who contract cancer is higher than those that are regular footed."

Sounds pretty goofy, right? ๐Ÿคช

It's certainly possible to come across statistics that seem questionable or that you might have doubts about. When encountering a statistic like this, it's always a good idea to try to verify the information and consider the context in which it is presented. This might involve looking for additional sources or seeking out more information about the study or data that the statistic is based on. It's also important to be aware of potential biases or agendas that might be influencing the way the statistic is presented. ๐Ÿง 

The process that scientists and data analysts use to make that conclusion comes from a process called . Inference is a process where a study is performed on a small sample of a population in which we compare two groups or perhaps one group to a given population. Through calculations involving the normal distribution, we can estimate what the true population parameter is or we can test a claim about a population given in an article of study using our sample statistics.

๐Ÿ’ก BIG IDEAS:

  • To estimate or predict a population parameter, we use a confidence interval!

  • To test a claim, we use a significance test!

Confidence Intervals

For this unit, we are going to be estimating population parameters involving . This means that our sample statistic will be a and we will be using that to estimate, or test against, a population proportion.

The first process we are going to use is a confidence interval. A confidence interval is an interval of numbers based on our that gives us a range where we can expect to find the true population proportion. A confidence interval will be based on three things: , sample size, and (usually 95%).

(1) Sample Proportion

It's important that the sample used to calculate a statistic be randomly selected in order to accurately represent the population. If the sample is not randomly selected, it can be biased and the resulting statistic (like the ) may not accurately represent the population. For example, if the sample used to calculate a statistic is not representative of the entire population, the statistic may not be a good estimate of the population parameter. This is why random sampling is important โ€“ it helps to ensure that the sample is representative of the population and that the resulting statistic is a good estimate of the population parameter. ๐ŸŽฉ

That being said, the first aspect of our confidence interval is our . In order for our to be a good estimate of our population proportion, it is necessary that it comes from a random sample. As mentioned before, there is no way to fix the lack of in a sample.

(2) Sample Size

Our sample size is also an important measure when used to calculate a confidence interval. Our sample size must be large enough that we can use a normal distribution to estimate our population proportion. In order to see that condition, refer back to what we said in Unit 5 with .

The of the sampling distribution of a statistic decreases as the sample size increases. This means that as the sample size increases, the sample statistic is less likely to be far from the true population parameter. As a result, the confidence interval for the population parameter will be narrower for a larger sample size.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-svebL9Ybd6Ue.png?alt=media&token=d4709e1d-7bef-4ccf-b003-23d8f51e14b8

Source: Towards Data Science

For example, let's say we are trying to estimate the proportion of people in a population who support a certain policy. If we have a sample size of 50, the of the sampling distribution of the will be larger than if we had a sample size of 500. This means that the confidence interval for the population proportion will be wider for a sample size of 50 compared to a sample size of 500. In other words, we'll be less confident in our estimate of the population proportion if we have a smaller sample size. ๐Ÿ˜ข

(3) Confidence Level

The is a measure of how confident we are that the confidence interval contains the true population parameter (i.e., true proportion of our population). It is expressed as a percentage and is calculated by considering the number of that would contain the true population parameter if we were to take multiple samples from the same population and construct for each sample.

The is chosen by the researcher and is typically set at a higher level, such as 90%, 95%, or 99%, to increase the confidence that the true population parameter is contained within the confidence interval. A higher will result in a wider confidence interval, but it will also increase the likelihood that the interval contains the true population parameter.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-8FaPTeeyxCD4.png?alt=media&token=1be46971-b37d-44a0-bff5-1518746a5918

Source: Lumen Learning

For example, if we set the to 95%, this means that if we were to take multiple samples from the same population and create a confidence interval for each sample, approximately 95% of those intervals would contain the true population parameter.

Another way to think of this is if we were to take 100 different samples from the same population and create 100 different 95% , ~95 of those 100 would contain the true proportion we are trying to estimate.

Our is also a key part of our because it determines our z* or based on the standard normal distribution. As our increases, so does our z*, which in turn increases the range of our confidence interval. ๐Ÿ‘

Significance Tests

When we are given a population parameter and we have some reason to believe that it is false, we can perform a significance test to check if that value is correct. With a significance test, we are going to estimate the probability of obtaining our collected sample from the sampling distribution of our sample size when we assume that the given population proportion is correct. If the probability of obtaining our collected sample is low given those two factors (claimed population proportion and our sample size), we might have reason to reject the claim or at least investigate it further. ๐Ÿ•ต๏ธ

As we had with and , our significance test hinges on the fact that we must meet the three conditions of inference: , and normality. Otherwise, our sample isn't reflective of the population, our isn't accurate, or our sampling distribution isn't normal so we cannot accurately calculate the probability of obtaining our sample. ๐Ÿ“–

Inference with Two Proportions

Just as we mentioned in Unit 5, we also may have to create or perform with two proportions. This is typically used in when comparing two samples to see the effectiveness of certain treatments. 2๏ธโƒฃ

As mentioned in Unit 5, our conditions for inference must be met with both samples and we can subtract our two centers to find the center of the sampling distribution between two proportions. The for this sampling distribution can be found on the reference page provided for AP testing.

For example, if a researcher is testing the effectiveness of a particular medicine or drug, the would randomly assign participants to a placebo group or the new treatment group. We would assume that there is no difference in the two groups and then compare the sample proportions of who recovered quicker between the two groups and if that difference is significant, then we would have an effective medicinal treatment.

๐ŸŽฅ Watch: AP Stats -- Unit 6

Key Terms to Review (14)

Binomial Population Proportions

: Binomial population proportions refer to situations where we are interested in estimating or making inferences about proportions within two possible outcomes (success/failure). It involves counting how many times an event occurs within a fixed number of trials.

Categorical Data

: Categorical data refers to data that can be divided into categories or groups based on qualitative characteristics.

Confidence Intervals

: Confidence intervals are ranges of values calculated from sample data that are likely to contain an unknown population parameter with a certain level of confidence.

Confidence Level

: Confidence level refers to how confident we are that our interval estimate contains or captures the true population parameter. It represents our degree of certainty or reliability in estimating this parameter.

Critical Value

: A critical value is a specific value that separates the rejection region from the non-rejection region in hypothesis testing. It is compared to the test statistic to determine whether to reject or fail to reject the null hypothesis.

Experimental Design

: Experimental design refers to the process of planning and conducting an experiment to investigate cause-and-effect relationships between variables. It involves defining treatments, assigning participants to different groups, and controlling for confounding factors to ensure valid results.

Independence

: Independence refers to events or variables that do not influence each other. If two events are independent, knowing one event occurred does not affect our knowledge about whether or not the other event will occur.

Population Proportions

: Population proportions refer to the proportion or percentage of a specific characteristic or attribute within an entire population.

Randomness

: Randomness refers to an unpredictable and haphazard pattern where each outcome has an equal chance of occurring. It plays a crucial role in statistical experiments and sampling techniques.

Sample Proportion

: The sample proportion is the ratio of the number of successes in a sample to the total number of observations in that sample.

Sampling Distributions

: Sampling distributions refer to the probability distributions that describe statistics calculated from samples taken from populations. They help us make inferences about population parameters based on sample statistics.

Significance Tests

: Significance tests help determine whether an observed effect or difference between groups is statistically significant or simply due to chance variation.

Standard Deviation

: The standard deviation measures the average amount of variation or dispersion in a set of data. It tells us how spread out the values are from the mean.

Statistical Inference

: Statistical inference involves using sample data to make conclusions or predictions about a larger population. It allows us to draw meaningful insights and make decisions based on limited information.


ยฉ 2024 Fiveable Inc. All rights reserved.

APยฎ and SATยฎ are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


ยฉ 2024 Fiveable Inc. All rights reserved.

APยฎ and SATยฎ are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.