🤖Statistical Prediction Unit 14 Review

14.1 Regression Metrics: MSE, RMSE, MAE, and R-squared

🤖Statistical Prediction
Unit 14 Review

14.1 Regression Metrics: MSE, RMSE, MAE, and R-squared

Written by the Fiveable Content Team • Last updated September 2025

🤖Statistical Prediction

Unit & Topic Study Guides

14.2 Classification Metrics: Accuracy, Precision, Recall, F1-score, and ROC-AUC

14.3 Evaluation Metrics for Unsupervised Learning and Clustering

Regression metrics help us gauge how well our models predict outcomes. MSE, RMSE, and MAE measure prediction errors, while R-squared shows how much variation our model explains. These tools are crucial for evaluating and comparing regression models.

Understanding these metrics is key to assessing model performance in real-world scenarios. They help us identify which models are most accurate and reliable, guiding us in making better predictions and decisions based on our data.

Error Metrics

Measuring Prediction Errors

Mean Squared Error (MSE) calculates the average squared difference between the predicted and actual values
- Formula: $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
- Squaring the errors amplifies larger errors and minimizes smaller ones
- Sensitive to outliers due to the squaring of errors
Root Mean Squared Error (RMSE) takes the square root of the MSE to bring the units back to the original scale
- Formula: $RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$
- Easier to interpret than MSE as it is in the same units as the target variable
- Still sensitive to outliers, but less so than MSE
Mean Absolute Error (MAE) calculates the average absolute difference between the predicted and actual values
- Formula: $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$
- Less sensitive to outliers compared to MSE and RMSE
- Provides a more intuitive understanding of the average error magnitude

Percentage-based Error Metric

Mean Absolute Percentage Error (MAPE) expresses the average absolute error as a percentage of the actual values
- Formula: $MAPE = \frac{100%}{n} \sum_{i=1}^{n} |\frac{y_i - \hat{y}_i}{y_i}|$
- Useful when the target variable has a wide range of values or when comparing models across different datasets
- Can be misleading when actual values are close to zero, as it can lead to large percentage errors

Analyzing Model Residuals

Residuals represent the differences between the predicted and actual values
- Formula: $residual_i = y_i - \hat{y}_i$
- Positive residuals indicate underestimation, while negative residuals indicate overestimation
- Analyzing residuals helps assess model assumptions and identify patterns or biases in the predictions
- Residual plots (residuals vs. predicted values) can reveal non-linear relationships or heteroscedasticity

Coefficient of Determination

Measuring Model Fit

R-squared (Coefficient of Determination) measures the proportion of variance in the target variable explained by the model
- Formula: $R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}$
- Ranges from 0 to 1, with higher values indicating a better fit
- Represents the improvement of the model compared to using the mean of the target variable as a prediction
- Can be interpreted as the percentage of variance explained by the model (e.g., R-squared of 0.75 means 75% of the variance is explained)
Adjusted R-squared penalizes the addition of unnecessary predictors to the model
- Formula: $Adjusted R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$, where $p$ is the number of predictors
- Useful for comparing models with different numbers of predictors
- Prevents overfitting by discouraging the inclusion of irrelevant variables

Assessing Model Goodness of Fit

Goodness of fit refers to how well the model fits the observed data
- A high R-squared or adjusted R-squared indicates a good fit, meaning the model captures a significant portion of the variability in the target variable
- However, a high R-squared does not necessarily imply a good model, as it can be affected by outliers or overfitting
- It is important to consider other diagnostic measures (residual plots, cross-validation) alongside R-squared to assess model performance and validity

🤖Statistical Prediction Unit 14 Review