Fiveable
Fiveable
You have 2 free guides left 😧
Unlock your guides

US History – 1945 to Present

Model fitting, interpretation, and diagnostics are crucial steps in regression analysis. They help us assess how well our model explains the data and whether it meets key assumptions. These tools allow us to evaluate the model's overall performance and the significance of individual predictors.

By examining metrics like R-squared, residual plots, and influence measures, we can identify potential issues with our model. This process helps us refine our analysis, ensuring we draw valid conclusions about relationships between variables in our dataset.

Model Evaluation Metrics

Measuring Model Fit and Explanatory Power

Top images from around the web for Measuring Model Fit and Explanatory Power
Top images from around the web for Measuring Model Fit and Explanatory Power
  • R-squared measures the proportion of variance in the response variable explained by the predictor variables
    • Ranges from 0 to 1, with higher values indicating better model fit
    • Calculated as the ratio of the explained sum of squares to the total sum of squares (R2=SSRSSTR^2 = \frac{SSR}{SST})
  • Adjusted R-squared accounts for the number of predictors in the model and penalizes adding unnecessary variables
    • Useful for comparing models with different numbers of predictors
    • Calculated as 1(1R2)(n1)np11 - \frac{(1-R^2)(n-1)}{n-p-1}, where nn is the sample size and pp is the number of predictors
  • Standard error of the estimate measures the average distance between the observed values and the predicted values
    • Smaller values indicate better model fit and more precise predictions
    • Calculated as (yiy^i)2np1\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-p-1}}, where yiy_i are the observed values and y^i\hat{y}_i are the predicted values

Testing Overall Model Significance

  • F-test for model significance assesses whether the model as a whole is statistically significant
    • Null hypothesis: all regression coefficients are equal to zero (the model has no predictive power)
    • Alternative hypothesis: at least one regression coefficient is not equal to zero (the model has some predictive power)
    • Calculated as F=MSRMSEF = \frac{MSR}{MSE}, where MSRMSR is the mean square regression and MSEMSE is the mean square error
    • A significant F-test (p-value < 0.05) indicates that the model is statistically significant and has predictive power

Coefficient Significance

Assessing Individual Predictor Significance

  • T-test for coefficients assesses whether each individual predictor variable is statistically significant
    • Null hypothesis: the regression coefficient for the predictor is equal to zero (the predictor has no effect on the response)
    • Alternative hypothesis: the regression coefficient for the predictor is not equal to zero (the predictor has an effect on the response)
    • Calculated as t=β^iSE(β^i)t = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)}, where β^i\hat{\beta}_i is the estimated regression coefficient and SE(β^i)SE(\hat{\beta}_i) is its standard error
    • A significant t-test (p-value < 0.05) indicates that the predictor is statistically significant and contributes to the model's predictive power
    • Example: in a model predicting house prices, a significant t-test for the "square footage" predictor would suggest that square footage has a significant effect on house prices

Residual Diagnostics

Assessing Model Assumptions Through Residual Plots

  • Residual plot displays the residuals (observed minus predicted values) against the predicted values
    • Used to check for linearity, homoscedasticity, and independence assumptions
    • Residuals should be randomly scattered around zero with no discernible pattern
    • Example: a funnel-shaped residual plot would indicate heteroscedasticity (non-constant variance of residuals)
  • Q-Q plot (Quantile-Quantile plot) compares the distribution of residuals to a normal distribution
    • Used to check the normality assumption of residuals
    • Points should fall close to a straight diagonal line if residuals are normally distributed
    • Deviations from the line indicate non-normality, such as heavy tails or skewness

Identifying Influential Observations

  • Outliers are observations with unusually large residuals that may have a disproportionate effect on the model
    • Can be identified using residual plots or standardized residuals (residuals divided by their standard error)
    • Observations with standardized residuals greater than ±2 or ±3 are potential outliers
  • Influential points are observations that have a large effect on the model coefficients when included or excluded
    • Can be identified using leverage and Cook's distance (discussed in the next section)
    • High leverage points are unusual combinations of predictor values that can greatly influence the model

Influence and Collinearity

Measuring Observation Influence

  • Cook's distance measures the influence of each observation on the model coefficients
    • Combines information from residuals and leverage
    • Larger values indicate more influential observations
    • A common rule of thumb is that observations with Cook's distance > 1 are considered highly influential
  • Leverage measures the unusualness of an observation's predictor values compared to the rest of the data
    • High leverage points can greatly influence the model, even if they have small residuals
    • Leverage values range from 0 to 1, with values > 2pn\frac{2p}{n} considered high leverage (where pp is the number of predictors and nn is the sample size)

Detecting Multicollinearity

  • Multicollinearity occurs when predictor variables are highly correlated with each other
    • Can lead to unstable coefficient estimates and difficulty interpreting individual predictor effects
    • Detected using correlation matrices, variance inflation factors (VIF), or condition indexes
    • VIF measures how much the variance of a coefficient is inflated due to collinearity
      • VIF = 11Ri2\frac{1}{1-R^2_i}, where Ri2R^2_i is the R-squared from regressing the i-th predictor on all other predictors
      • VIF > 5 or 10 indicates high collinearity for that predictor
    • Example: in a model predicting car prices, high collinearity between "engine size" and "horsepower" could make it difficult to determine their individual effects on price
© 2025 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2025 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2025 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary