The chi-square goodness-of-fit test checks whether observed data matches a distribution you'd expect. For example, if you roll a die 60 times, you'd expect each face to come up about 10 times. This test tells you whether the differences between what you observed and what you expected are large enough to be statistically meaningful.

Calculating the Chi-Square Statistic

The core idea: compare what you observed in your data to what you expected under some hypothesis, then measure how far off things are.

Step 1: Find expected frequencies. Multiply each category's hypothesized proportion by the total sample size.

$\text{Expected frequency} = \text{Hypothesized proportion} \times \text{Total sample size}$

For a fair six-sided die rolled 60 times, each face has a hypothesized proportion of 1/6, so the expected frequency for each face is $\frac{1}{6} \times 60 = 10$ .

Step 2: Compute the chi-square statistic. For each category:

Subtract the expected frequency from the observed frequency
Square that difference
Divide by the expected frequency

Then sum across all categories:

$\chi^2 = \sum \frac{(O - E)^2}{E}$

The squaring matters because it prevents positive and negative differences from canceling each other out, and it gives extra weight to large deviations.

Step 3: Determine degrees of freedom.

$df = k - 1$

where $k$ is the number of categories. For a six-sided die, $df = 6 - 1 = 5$ . You lose one degree of freedom because the expected frequencies must add up to the total sample size, so the last category isn't free to vary.

Calculation of chi-square statistics, Chi square calculator - wikidoc

Interpreting P-Values

The p-value is the probability of getting a chi-square statistic as large as (or larger than) the one you calculated, assuming the null hypothesis is true.

Null hypothesis ( $H_0$ ): The observed data follows the hypothesized distribution.
Alternative hypothesis ( $H_a$ ): The observed data does not follow the hypothesized distribution.

This is always a right-tailed test because larger $\chi^2$ values indicate bigger discrepancies between observed and expected.

How to interpret:

If p-value < significance level (typically 0.05): reject $H_0$ . There's sufficient evidence that the data does not fit the hypothesized distribution.
If p-value ≥ significance level: fail to reject $H_0$ . There's insufficient evidence to conclude the data doesn't fit.

"Fail to reject" is not the same as proving the distribution is correct. You're just saying the data doesn't give you enough reason to doubt it.

Calculation of chi-square statistics, Goodness-of-Fit Test | Introduction to Statistics

Applying the Goodness-of-Fit Test (Lab Workflow)

State the hypothesized distribution and collect observed data. For instance, a fair die should produce equal proportions (1/6 each), or a survey might hypothesize specific response proportions.
Calculate expected frequencies for each category using the formula above. Check that every expected frequency is at least 5; if not, the test may not be reliable.
Choose a significance level before running the test (0.05 is standard).
Enter your data into statistical software (such as R, SPSS, or a calculator) and run the chi-square goodness-of-fit test.
Read the output, which will include:
- The $\chi^2$ test statistic
- Degrees of freedom
- P-value
State your conclusion in context. If p-value < 0.05, you'd write something like: "There is sufficient evidence to conclude that the die is not fair." If p-value ≥ 0.05: "There is insufficient evidence to conclude that the die is not fair."

Statistical Inference and Error Types

The goodness-of-fit test is a form of statistical inference, where you use sample data to draw conclusions about a larger population or process.

Two types of errors can occur in any hypothesis test:

Type I error: Rejecting a true null hypothesis. You conclude the data doesn't fit the distribution, but it actually does. The probability of this equals your significance level ( $\alpha$ ).
Type II error: Failing to reject a false null hypothesis. You conclude the data fits the distribution, but it actually doesn't.

You can reduce Type I error risk by choosing a smaller $\alpha$ , but that increases the risk of Type II error. There's always a tradeoff.

One more practical note: contingency tables are a helpful way to organize your observed and expected frequencies side by side before computing the test statistic, especially when you have many categories.