Fiveable
Fiveable
pep
Fiveable
Fiveable

or

Log in

Find what you need to study


Light

8.7 Skills Focus: Selecting an Appropriate Inference Procedure for Categorical Data

6 min readjanuary 7, 2023

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Attend a live cram event

Review all units live with expert teachers & students

The Most Important Part(s) of Unit 8...

The most difficult and most important part of Unit 8 is being able to select which to perform. Be sure to study these important distinctions for clarity on which test to select: 🔎

  1. : One sample, one categorical variable with more than two categories

  2. : One sample, two categorical variables with multiple categories

  3. Two samples, one categorical variable with possible multiple categories

It is very likely that you will see one or two multiple-choice questions on this exact content: selecting an appropriate inference method.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreenshot%202023-01-07%20at%209.07-fIzjPN8VrTcs.png?alt=media&token=2e152c66-60eb-481a-a60e-f1942dd5fe73

Source: Dan Shuster

Example

On the 2009 AP Statistics exam, the following question was presented in the FRQ section:

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F8-v8s94GdcPh57.png?alt=media&token=eed0f0bf-883f-446b-ad36-1354d8b9dd61

Image from released College Board material

The first thing we should notice is that this data is dealing with categorical data. This tells us that we should either use a z-test or a , depending on how many variables and categories we are dealing with. 🤔

Then we notice that there are two categorical variables with two to three categories apiece. This narrows out a z-test since a 1-prop z-test or 2-prop z-test would only be valid if each variable only had two categories. Since we have two variables with multiple categories, this shifts us to a .

Now, we are stuck between the three types of chi-squared tests. Uh-oh...

The first thing to notice now is that we have a two-way table, not a one way table with multiple rows/columns so that narrows it down to either or .

The last thing we need to check in narrowing this down is how many samples/populations we have. Since we only took one sample and asked their gender and job experience, this would mean that we are looking at the between gender and job experience, not the difference in two populations. Therefore, we should run a .

Our hypotheses then should be:

  • H0: There is no between gender and job experience for high school seniors in the district.

  • Ha: There is an between gender and job experience for high school seniors in the district.

Things to remember: be sure to put your hypotheses in context and your null hypothesis is always the “expected” outcome (i.e., there is nothing special going on).

Practice Problem

(1) A researcher is interested in determining whether the distribution of favorite ice cream flavors among college students is the same as the distribution of favorite ice cream flavors among the general population. They survey a of 500 college students and find that 280 students prefer chocolate, 120 students prefer vanilla, 50 students prefer strawberry, and 50 students prefer mint. The researcher also surveys a of 1000 people from the general population and finds that 400 people prefer chocolate, 300 people prefer vanilla, 200 people prefer strawberry, and 100 people prefer mint. The researcher wants to know whether the distribution of favorite ice cream flavors is the same among college students and the general population. 🍨

To answer this question, the researcher plans to conduct a . However, the researcher is unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the researcher use and why?

(2) A scientist is studying the effectiveness of a new treatment for a particular disease. They conduct a clinical trial with 100 patients and divide them into two groups: a and a . The receives the new treatment, while the receives a placebo. The scientist wants to determine whether the treatment is effective at reducing the occurrence of the disease in male and female patients. 🦠

To do this, the scientist plans to conduct a . However, the scientist is unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the scientist use and why?

(3) A travel company is interested in determining whether the distribution of vacation package choices made by their customers fits a that they formulated based on previous years' trends in domestic and international travel. The company surveyed 1000 customers and found that 400 customers chose a beach vacation package, 300 customers chose a mountain vacation package, 200 customers chose a city vacation package, and 100 customers chose a rural vacation package. 🚀

The travel company plans to conduct a to answer their research question. However, they are unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the travel company use and why?

Answer

(1) The appropriate test for this situation is a . This is because the researcher is interested in determining whether the distribution of favorite ice cream flavors is the same between two groups (college students and the general population), which is a test of independence.

A would be used if the researcher was interested in determining whether the observed distribution of favorite ice cream flavors among college students fits a .

A would be used if the researcher was interested in determining whether the distribution of favorite ice cream flavors is the same among different subgroups within a single population (such as male and female college students).

Therefore, the researcher should use a to determine whether the distribution of favorite ice cream flavors is the same between college students and the general population.

(2) The appropriate test for this situation is a . This is because the scientist is interested in determining whether the distribution of the disease is the same among male and female patients within a single group (the ). A allows the scientist to determine whether there is a difference in the distribution of the disease between male and female patients in the .

A would be used if the scientist was interested in determining whether the observed distribution of the disease among the fits a .

A would be used if the scientist was interested in determining whether there is a relationship between the treatment (the independent variable) and the occurrence of the disease (the dependent variable).

Therefore, the scientist should use a to determine whether there is a difference in the distribution of the disease between male and female patients in the .

(3) The appropriate test for this situation is a . This is because the travel company is interested in determining whether the observed distribution of vacation package choices fits a . A allows the travel company to compare the observed distribution with the and determine whether the two are similar.

A would be used if the travel company was interested in determining whether there is a relationship between two variables (such as the type of vacation package and the destination chosen).

A would be used if the travel company was interested in determining whether the distribution of vacation package choices is the same among different subgroups within a single population (such as male and female customers).

Therefore, the travel company should use a to determine whether the observed distribution of vacation package choices fits a .

🎥  Watch: AP Stats Unit 8 - Chi Squared Tests

Key Terms to Review (14)

Alternative Hypothesis (Ha)

: The alternative hypothesis, denoted as Ha, is a statement that contradicts or challenges the null hypothesis. It suggests that there is a significant relationship or difference between variables being studied.

Association

: Association refers to a statistical relationship between two variables where changes in one variable tend to be related to changes in another variable. It does not imply causation but indicates some form of connection.

Chi-squared procedure

: The chi-squared procedure is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the expected frequencies under the assumption of independence.

Chi-Squared test

: The Chi-Squared test is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies with the expected frequencies to assess whether any deviation from expected values is due to chance or not.

Chi-squared test for goodness of fit

: The chi-squared test for goodness of fit is a statistical test used to determine if observed categorical data fits an expected distribution. It compares the observed frequencies with the expected frequencies and assesses whether any significant differences exist.

Chi-Squared Test for Homogeneity

: The Chi-Squared Test for Homogeneity compares whether different populations have similar distributions across multiple categories or variables.

Chi-Squared Test for Independence

: The Chi-Squared Test for Independence is used to determine if there is a relationship between two categorical variables in a population.

Control Group

: The control group refers to the group of participants in an experiment who do not receive any treatment or intervention. They serve as a baseline for comparison with the treatment group.

Goodness of Fit Test

: A Goodness of Fit Test is a statistical test used to determine how well an observed sample data fits an expected theoretical distribution. It assesses whether any differences between observed and expected frequencies are statistically significant or simply due to random chance.

Homogeneity Test

: A homogeneity test is a statistical test used to determine if the distribution of categorical data is similar across different groups or categories. It helps to assess whether there are significant differences in proportions or frequencies among the groups.

Independence Test

: An Independence Test is a statistical test used to determine if there is an association or relationship between two categorical variables. It assesses whether the occurrence of one variable is independent of (not influenced by) the occurrence of another variable.

Random Sample

: A random sample is a subset of individuals selected from a larger population in such a way that every individual has an equal chance of being chosen. It helps to ensure that the sample is representative of the population.

Theoretical Distribution

: A theoretical distribution represents all possible outcomes and their associated probabilities for a random variable under certain assumptions. It provides information about how likely different values are to occur.

Treatment Group

: The treatment group refers to the group of participants in an experiment who receive a specific treatment or intervention.

8.7 Skills Focus: Selecting an Appropriate Inference Procedure for Categorical Data

6 min readjanuary 7, 2023

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Josh Argo

Josh Argo

Jed Quiaoit

Jed Quiaoit

Attend a live cram event

Review all units live with expert teachers & students

The Most Important Part(s) of Unit 8...

The most difficult and most important part of Unit 8 is being able to select which to perform. Be sure to study these important distinctions for clarity on which test to select: 🔎

  1. : One sample, one categorical variable with more than two categories

  2. : One sample, two categorical variables with multiple categories

  3. Two samples, one categorical variable with possible multiple categories

It is very likely that you will see one or two multiple-choice questions on this exact content: selecting an appropriate inference method.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreenshot%202023-01-07%20at%209.07-fIzjPN8VrTcs.png?alt=media&token=2e152c66-60eb-481a-a60e-f1942dd5fe73

Source: Dan Shuster

Example

On the 2009 AP Statistics exam, the following question was presented in the FRQ section:

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F8-v8s94GdcPh57.png?alt=media&token=eed0f0bf-883f-446b-ad36-1354d8b9dd61

Image from released College Board material

The first thing we should notice is that this data is dealing with categorical data. This tells us that we should either use a z-test or a , depending on how many variables and categories we are dealing with. 🤔

Then we notice that there are two categorical variables with two to three categories apiece. This narrows out a z-test since a 1-prop z-test or 2-prop z-test would only be valid if each variable only had two categories. Since we have two variables with multiple categories, this shifts us to a .

Now, we are stuck between the three types of chi-squared tests. Uh-oh...

The first thing to notice now is that we have a two-way table, not a one way table with multiple rows/columns so that narrows it down to either or .

The last thing we need to check in narrowing this down is how many samples/populations we have. Since we only took one sample and asked their gender and job experience, this would mean that we are looking at the between gender and job experience, not the difference in two populations. Therefore, we should run a .

Our hypotheses then should be:

  • H0: There is no between gender and job experience for high school seniors in the district.

  • Ha: There is an between gender and job experience for high school seniors in the district.

Things to remember: be sure to put your hypotheses in context and your null hypothesis is always the “expected” outcome (i.e., there is nothing special going on).

Practice Problem

(1) A researcher is interested in determining whether the distribution of favorite ice cream flavors among college students is the same as the distribution of favorite ice cream flavors among the general population. They survey a of 500 college students and find that 280 students prefer chocolate, 120 students prefer vanilla, 50 students prefer strawberry, and 50 students prefer mint. The researcher also surveys a of 1000 people from the general population and finds that 400 people prefer chocolate, 300 people prefer vanilla, 200 people prefer strawberry, and 100 people prefer mint. The researcher wants to know whether the distribution of favorite ice cream flavors is the same among college students and the general population. 🍨

To answer this question, the researcher plans to conduct a . However, the researcher is unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the researcher use and why?

(2) A scientist is studying the effectiveness of a new treatment for a particular disease. They conduct a clinical trial with 100 patients and divide them into two groups: a and a . The receives the new treatment, while the receives a placebo. The scientist wants to determine whether the treatment is effective at reducing the occurrence of the disease in male and female patients. 🦠

To do this, the scientist plans to conduct a . However, the scientist is unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the scientist use and why?

(3) A travel company is interested in determining whether the distribution of vacation package choices made by their customers fits a that they formulated based on previous years' trends in domestic and international travel. The company surveyed 1000 customers and found that 400 customers chose a beach vacation package, 300 customers chose a mountain vacation package, 200 customers chose a city vacation package, and 100 customers chose a rural vacation package. 🚀

The travel company plans to conduct a to answer their research question. However, they are unsure whether a , homogeneity, or independence is the appropriate test to use.

Which test should the travel company use and why?

Answer

(1) The appropriate test for this situation is a . This is because the researcher is interested in determining whether the distribution of favorite ice cream flavors is the same between two groups (college students and the general population), which is a test of independence.

A would be used if the researcher was interested in determining whether the observed distribution of favorite ice cream flavors among college students fits a .

A would be used if the researcher was interested in determining whether the distribution of favorite ice cream flavors is the same among different subgroups within a single population (such as male and female college students).

Therefore, the researcher should use a to determine whether the distribution of favorite ice cream flavors is the same between college students and the general population.

(2) The appropriate test for this situation is a . This is because the scientist is interested in determining whether the distribution of the disease is the same among male and female patients within a single group (the ). A allows the scientist to determine whether there is a difference in the distribution of the disease between male and female patients in the .

A would be used if the scientist was interested in determining whether the observed distribution of the disease among the fits a .

A would be used if the scientist was interested in determining whether there is a relationship between the treatment (the independent variable) and the occurrence of the disease (the dependent variable).

Therefore, the scientist should use a to determine whether there is a difference in the distribution of the disease between male and female patients in the .

(3) The appropriate test for this situation is a . This is because the travel company is interested in determining whether the observed distribution of vacation package choices fits a . A allows the travel company to compare the observed distribution with the and determine whether the two are similar.

A would be used if the travel company was interested in determining whether there is a relationship between two variables (such as the type of vacation package and the destination chosen).

A would be used if the travel company was interested in determining whether the distribution of vacation package choices is the same among different subgroups within a single population (such as male and female customers).

Therefore, the travel company should use a to determine whether the observed distribution of vacation package choices fits a .

🎥  Watch: AP Stats Unit 8 - Chi Squared Tests

Key Terms to Review (14)

Alternative Hypothesis (Ha)

: The alternative hypothesis, denoted as Ha, is a statement that contradicts or challenges the null hypothesis. It suggests that there is a significant relationship or difference between variables being studied.

Association

: Association refers to a statistical relationship between two variables where changes in one variable tend to be related to changes in another variable. It does not imply causation but indicates some form of connection.

Chi-squared procedure

: The chi-squared procedure is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the expected frequencies under the assumption of independence.

Chi-Squared test

: The Chi-Squared test is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies with the expected frequencies to assess whether any deviation from expected values is due to chance or not.

Chi-squared test for goodness of fit

: The chi-squared test for goodness of fit is a statistical test used to determine if observed categorical data fits an expected distribution. It compares the observed frequencies with the expected frequencies and assesses whether any significant differences exist.

Chi-Squared Test for Homogeneity

: The Chi-Squared Test for Homogeneity compares whether different populations have similar distributions across multiple categories or variables.

Chi-Squared Test for Independence

: The Chi-Squared Test for Independence is used to determine if there is a relationship between two categorical variables in a population.

Control Group

: The control group refers to the group of participants in an experiment who do not receive any treatment or intervention. They serve as a baseline for comparison with the treatment group.

Goodness of Fit Test

: A Goodness of Fit Test is a statistical test used to determine how well an observed sample data fits an expected theoretical distribution. It assesses whether any differences between observed and expected frequencies are statistically significant or simply due to random chance.

Homogeneity Test

: A homogeneity test is a statistical test used to determine if the distribution of categorical data is similar across different groups or categories. It helps to assess whether there are significant differences in proportions or frequencies among the groups.

Independence Test

: An Independence Test is a statistical test used to determine if there is an association or relationship between two categorical variables. It assesses whether the occurrence of one variable is independent of (not influenced by) the occurrence of another variable.

Random Sample

: A random sample is a subset of individuals selected from a larger population in such a way that every individual has an equal chance of being chosen. It helps to ensure that the sample is representative of the population.

Theoretical Distribution

: A theoretical distribution represents all possible outcomes and their associated probabilities for a random variable under certain assumptions. It provides information about how likely different values are to occur.

Treatment Group

: The treatment group refers to the group of participants in an experiment who receive a specific treatment or intervention.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.