3.4 Contingency Tables

3 min readjune 27, 2024

are powerful tools for analyzing relationships between . They display frequency distributions and allow for calculating , helping us understand how variables like gender and color preference interact.

These tables enable us to explore , , and . By examining the data, we can determine if variables are related and how strongly. This analysis forms the basis for about population relationships.

Contingency Tables

Probabilities from contingency tables

Top images from around the web for Probabilities from contingency tables
Top images from around the web for Probabilities from contingency tables
  • Contingency tables (also called ) display the of two categorical variables
    • Rows represent the levels of one variable (gender, also known as )
    • Columns represent the levels of the other variable (favorite color, also known as )
    • Each cell contains the frequency or count of observations for a specific combination of the two variables (number of males who prefer blue, also called )
  • Calculate probabilities using a contingency table:
    1. Find the total number of observations by summing all cell frequencies
    2. Divide each cell frequency by the total number of observations to obtain the probability for that specific combination of the two variables
    3. Use the formula P(AB)=frequency(AB)totalP(A \cap B) = \frac{frequency(A \cap B)}{total}, where AA and BB are specific levels of the two variables (probability of being male and preferring blue)

Relationships in two-way tables

  • Independence: Two variables are independent if the probability of one variable is not affected by the level of the other variable
    • If variables are independent, (cell probabilities) equal the product of
    • P(AB)=P(A)×P(B)P(A \cap B) = P(A) \times P(B) for independent variables (probability of being male and preferring blue equals probability of being male times probability of preferring blue)
  • Association: Two variables are associated if the probability of one variable changes depending on the level of the other variable
    • If variables are associated, joint probabilities differ from the product of marginal probabilities
    • Assess strength and direction of association by comparing joint probabilities across different levels of variables (probability of preferring blue may be higher for females than males)

Conditional and marginal probabilities

  • Marginal probability: Probability of an event occurring for one variable, regardless of the level of the other variable
    • Calculate marginal probabilities by summing probabilities across a row or column
    • P(A)=allBP(AB)P(A) = \sum_{all\,B} P(A \cap B) and P(B)=allAP(AB)P(B) = \sum_{all\,A} P(A \cap B) (probability of being male equals sum of probabilities of being male and preferring each color)
  • Conditional probability: Probability of an event occurring for one variable, given a specific level of the other variable
    • Calculate conditional probabilities by dividing joint probability by marginal probability of the given condition
    • P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)} and P(BA)=P(AB)P(A)P(B|A) = \frac{P(A \cap B)}{P(A)} (probability of preferring blue given being male equals probability of being male and preferring blue divided by probability of being male)
  • Relationship between conditional and marginal probabilities:
    • If variables are independent, conditional probability equals marginal probability (probability of preferring blue is the same for males and females)
    • If variables are associated, conditional probability differs from marginal probability (probability of preferring blue may be different for males and females)

Contingency Analysis and Statistical Inference

  • in contingency tables are used to conduct statistical tests and make inferences about populations
  • involves examining the relationship between categorical variables
  • Statistical inference techniques can be applied to determine if observed associations are statistically significant

Key Terms to Review (30)

Association: Association is a statistical concept that describes the relationship between two or more variables, indicating the degree to which they are related or connected. It is a fundamental principle in the analysis of contingency tables, which examine the interdependence between categorical variables.
Categorical Variables: Categorical variables are variables that represent distinct categories or groups, rather than numerical values. They are used to classify data into different groups or types based on qualitative characteristics.
Cell Frequencies: Cell frequencies refer to the observed counts or values in the cells of a contingency table, which is a tabular representation of the relationship between two or more categorical variables. These cell frequencies provide information about the distribution and association between the variables being analyzed.
Chi-Square Distribution: The chi-square distribution is a probability distribution that arises when independent standard normal random variables are squared and summed. It is a continuous probability distribution that is widely used in statistical hypothesis testing, particularly in assessing the goodness of fit of observed data to a theoretical distribution, testing the independence of two attributes, and testing the homogeneity of multiple populations.
Chi-Square Test: The chi-square test is a statistical hypothesis test used to determine if there is a significant difference between observed and expected frequencies or proportions in one or more categories. It is a versatile test that can be applied in various contexts, including contingency tables, discrete distributions, and tests of independence or variance.
Column Variables: Column variables are the categorical or quantitative variables that are displayed in the columns of a contingency table. They represent the different groups or categories being compared across the rows of the table.
Conditional Probabilities: Conditional probabilities refer to the likelihood of an event occurring given that another event has already occurred. They represent the probability of one event happening, taking into account the information provided by the occurrence of a related event.
Contingency Analysis: Contingency analysis is a statistical technique used to examine the relationship between two categorical variables. It involves the use of contingency tables to assess the degree of association or independence between the variables.
Contingency Tables: A contingency table, also known as a cross-tabulation or two-way table, is a statistical tool used to display and analyze the relationship between two or more categorical variables. It provides a way to investigate the association or dependence between these variables by organizing the data into a tabular format.
Cramér's V: Cramér's V is a measure of the strength of association between two categorical variables in a contingency table. It is a statistic that ranges from 0 to 1, with 0 indicating no association and 1 indicating a perfect association between the variables.
Crosstab Function: The crosstab function is a powerful data analysis tool that allows researchers to explore the relationship between two or more categorical variables. It creates a table that displays the frequency or count of observations that fall into each combination of the categories across the variables being analyzed.
Degrees of Freedom: Degrees of freedom (df) is a fundamental statistical concept that represents the number of independent values or observations that can vary in a given situation. It is an essential parameter that determines the appropriate statistical test or distribution to use in various data analysis techniques.
Expected Frequency: Expected frequency refers to the anticipated or predicted number of observations in each category or cell of a contingency table, assuming the null hypothesis is true. It is a crucial concept in various statistical tests, including the goodness-of-fit test, test of independence, and chi-square goodness-of-fit analysis.
Frequency Distribution: A frequency distribution is a tabular or graphical representation that organizes and summarizes data by grouping values into distinct classes or intervals and displaying the number of observations that fall into each class. It provides a concise way to understand the underlying patterns and characteristics of a dataset.
Grand Total: The grand total is a numerical value that represents the sum of all the individual values or subtotals in a data set or table. It provides an overall summary of the total quantity or magnitude across all the components being measured or analyzed.
Independence: Independence is a fundamental concept in statistics that describes the relationship between two or more variables or events. When variables or events are independent, the occurrence or value of one does not depend on or influence the occurrence or value of the other. This concept is crucial in understanding various statistical analyses and probability distributions.
Joint Probabilities: Joint probabilities refer to the likelihood of two or more events occurring together. It represents the probability of multiple events happening concurrently within a given scenario or experiment.
Karl Pearson: Karl Pearson was a pioneering British statistician who made significant contributions to the field of statistics, particularly in the areas of correlation, regression analysis, and the chi-square test. His work laid the foundation for many statistical techniques used in modern data analysis.
Marginal Probabilities: Marginal probabilities refer to the probabilities of individual events or variables, considered independently of their relationship to other events or variables. They represent the likelihood of an event occurring without taking into account the influence of other factors.
Marginal Totals: Marginal totals refer to the row and column totals in a contingency table. They provide important information about the overall distribution of the data and are essential for understanding the relationships between the variables in the table.
Nominal: Nominal refers to a variable or measurement that is classified into distinct categories or groups, without any inherent order or numerical value. It is a type of qualitative or categorical data that is used to represent discrete, non-numerical characteristics.
Null Hypothesis: The null hypothesis, denoted as H0, is a statistical hypothesis that states there is no significant difference or relationship between the variables being studied. It represents the default or initial position that a researcher takes before conducting an analysis or experiment.
Observed Frequencies: Observed frequencies refer to the actual or empirical counts of occurrences in a dataset, often displayed in a contingency table or frequency distribution. This term is central to understanding the application of chi-square tests in statistics, which compare observed frequencies to expected frequencies to determine statistical significance.
Ordinal: Ordinal refers to a level of measurement where data is categorized and ranked in a specific order, but the differences between the categories are not necessarily equal. It represents an ordered sequence or hierarchy, but the precise numerical values are not meaningful.
Phi Coefficient: The phi coefficient, denoted as $\phi$, is a measure of the strength and direction of the association between two binary variables. It is a special case of the Pearson correlation coefficient, used when both variables are dichotomous or binary in nature.
Probabilities: Probabilities refer to the quantification of the likelihood or chance of an event occurring. They provide a numerical measure of how likely a particular outcome is within a given set of possible outcomes.
Row Variables: In the context of contingency tables, row variables refer to the categorical variables that are displayed as the rows of the table. These variables represent the different groups or categories being compared across the columns of the table.
Statistical Inference: Statistical inference is the process of using data analysis to infer properties about a population from a sample. It involves drawing conclusions and making predictions based on the information gathered from a subset of a larger group or dataset.
Two-Way Tables: A two-way table, also known as a contingency table or cross-tabulation, is a type of data display that organizes and summarizes categorical data by showing the relationship between two variables. It arranges the data into rows and columns, providing a clear visual representation of the frequencies or counts associated with the different combinations of the variables.
Yates' Correction: Yates' correction is a statistical adjustment used in the analysis of 2x2 contingency tables, particularly when the expected cell frequencies are small. It is designed to provide a more accurate p-value for the chi-square test of independence by compensating for the overestimation of the test statistic when the sample size is small.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.