Chi-Square Goodness-of-Fit Test
The chi-square goodness-of-fit test checks whether observed categorical data matches a specific expected distribution. You'll use it whenever you have counts across categories and want to know: does what we observed differ significantly from what we predicted?

Calculation of Chi-Square Test Statistics
The goodness-of-fit test compares observed frequencies to expected ones across categories, then combines those differences into a single test statistic.
- Null hypothesis (): The data follows the hypothesized distribution
- Alternative hypothesis (): The data does not follow the hypothesized distribution
To calculate the test statistic:
-
Record the observed frequency () for each category from your data.
-
Calculate the expected frequency () for each category using the hypothesized distribution:
- , where is the total sample size and is the hypothesized proportion for category
-
Compute the chi-square test statistic:
- , where is the number of categories
Each term in that sum measures how far one category's observed count is from its expected count, scaled by the expected count. Squaring the difference means both overestimates and underestimates contribute positively to the statistic. A larger value means a worse fit between observed and expected.
Degrees of freedom: , where is the number of categories. You lose one degree of freedom because the category counts must sum to .
Condition check: Every expected frequency should be at least 5 for the chi-square approximation to be reliable. If any expected count falls below 5, consider combining categories.

Interpretation of P-Values for Distributions
The p-value is the probability of getting a statistic as large as (or larger than) the one you calculated, assuming is true. Because the goodness-of-fit test is always right-tailed, you're only looking at the upper end of the chi-square distribution.
- If p-value < significance level (typically 0.05): Reject . There is sufficient evidence that the data does not follow the hypothesized distribution.
- Risk: You might be committing a Type I error (rejecting a true null hypothesis).
- If p-value ≥ significance level: Fail to reject . There is not enough evidence to conclude the data differs from the hypothesized distribution.
- Risk: You might be committing a Type II error (failing to reject a false null hypothesis).
Always state your conclusion in context. Don't just say "reject ." Say something like: "At the 0.05 significance level, there is sufficient evidence that the distribution of candy colors differs from the company's claimed proportions."

Application of Chi-Square Tests
Here's the full process for conducting a goodness-of-fit test:
-
Identify the categorical variable and state the hypothesized distribution (the proportions you expect).
-
Collect a random sample and record observed frequencies for each category.
-
Calculate expected frequencies using .
-
Verify that all expected frequencies are at least 5.
-
Compute the test statistic and determine .
-
Find the p-value from the chi-square distribution table or calculator.
-
Compare the p-value to your significance level and state your conclusion in context.
Example: M&M Colors. Suppose Mars claims 20% of M&Ms are blue. You buy a bag of 200 and count 52 blue ones. The expected count is . That single category contributes to the statistic. You'd repeat this for every color, sum the contributions, and then find the p-value.
Example: Fair Die. Roll a die 120 times. Under a fair die, you expect outcomes per face. You'd compare your observed counts for each face against 20 using the same formula, with .
Note: The goodness-of-fit test applies to one categorical variable. Analyzing the relationship between two categorical variables uses a chi-square test of independence, which is a different procedure.
Statistical Power and Effect Size
- Statistical power is the probability of correctly rejecting when it's actually false. Higher power means you're less likely to miss a real difference.
- Power increases with larger sample sizes, larger significance levels, and larger effect sizes.
- Effect size quantifies how much the observed distribution deviates from the expected one. A statistically significant result with a tiny effect size may not be practically meaningful.
In lab settings, you'll often work with fixed sample sizes, so the main takeaway is this: with a small sample, you might fail to detect a real difference simply because you lack the power to do so.