14.1 Regression Metrics: MSE, RMSE, MAE, and R-squared

3 min readaugust 7, 2024

Regression metrics help us gauge how well our models predict outcomes. MSE, RMSE, and MAE measure prediction errors, while shows how much variation our model explains. These tools are crucial for evaluating and comparing regression models.

Understanding these metrics is key to assessing model performance in real-world scenarios. They help us identify which models are most accurate and reliable, guiding us in making better predictions and decisions based on our data.

Error Metrics

Measuring Prediction Errors

Top images from around the web for Measuring Prediction Errors
Top images from around the web for Measuring Prediction Errors
  • calculates the average squared difference between the predicted and actual values
    • Formula: MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
    • Squaring the errors amplifies larger errors and minimizes smaller ones
    • Sensitive to outliers due to the squaring of errors
  • takes the square root of the MSE to bring the units back to the original scale
    • Formula: RMSE=MSE=1ni=1n(yiy^i)2RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
    • Easier to interpret than MSE as it is in the same units as the target variable
    • Still sensitive to outliers, but less so than MSE
  • calculates the average absolute difference between the predicted and actual values
    • Formula: MAE=1ni=1nyiy^iMAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
    • Less sensitive to outliers compared to MSE and RMSE
    • Provides a more intuitive understanding of the average error magnitude

Percentage-based Error Metric

  • Mean Absolute Percentage Error (MAPE) expresses the average absolute error as a percentage of the actual values
    • Formula: MAPE=100%ni=1nyiy^iyiMAPE = \frac{100\%}{n} \sum_{i=1}^{n} |\frac{y_i - \hat{y}_i}{y_i}|
    • Useful when the target variable has a wide range of values or when comparing models across different datasets
    • Can be misleading when actual values are close to zero, as it can lead to large percentage errors

Analyzing Model Residuals

  • represent the differences between the predicted and actual values
    • Formula: residuali=yiy^iresidual_i = y_i - \hat{y}_i
    • Positive residuals indicate underestimation, while negative residuals indicate overestimation
    • Analyzing residuals helps assess model assumptions and identify patterns or biases in the predictions
    • Residual plots (residuals vs. predicted values) can reveal non-linear relationships or heteroscedasticity

Coefficient of Determination

Measuring Model Fit

  • R-squared (Coefficient of Determination) measures the proportion of variance in the target variable explained by the model
    • Formula: R2=1SSresSStot=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
    • Ranges from 0 to 1, with higher values indicating a better fit
    • Represents the improvement of the model compared to using the mean of the target variable as a prediction
    • Can be interpreted as the percentage of variance explained by the model (e.g., R-squared of 0.75 means 75% of the variance is explained)
  • Adjusted R-squared penalizes the addition of unnecessary predictors to the model
    • Formula: AdjustedR2=1(1R2)(n1)np1Adjusted R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}, where pp is the number of predictors
    • Useful for comparing models with different numbers of predictors
    • Prevents by discouraging the inclusion of irrelevant variables

Assessing Model Goodness of Fit

  • Goodness of fit refers to how well the model fits the observed data
    • A high R-squared or adjusted R-squared indicates a good fit, meaning the model captures a significant portion of the variability in the target variable
    • However, a high R-squared does not necessarily imply a good model, as it can be affected by outliers or overfitting
    • It is important to consider other diagnostic measures (residual plots, ) alongside R-squared to assess model performance and validity

Key Terms to Review (16)

Cross-validation: Cross-validation is a statistical technique used to assess the performance of a predictive model by dividing the dataset into subsets, training the model on some of these subsets while validating it on the remaining ones. This process helps to ensure that the model generalizes well to unseen data and reduces the risk of overfitting by providing a more reliable estimate of its predictive accuracy.
Dependent variable: A dependent variable is the outcome or response that researchers measure in an experiment or a statistical model, which is expected to change when the independent variable is altered. This variable relies on the values of the independent variable, allowing for the evaluation of relationships and effects within data analysis. Understanding how the dependent variable interacts with other variables is crucial for interpreting results and making predictions in various analytical methods.
Homoscedasticity: Homoscedasticity refers to the property of a dataset where the variance of the residuals or errors is constant across all levels of the independent variable. This concept is crucial because it impacts the validity of regression analyses and model diagnostics, ensuring that the predictions made by the model are reliable and unbiased. When homoscedasticity holds, it allows for better interpretation of regression coefficients and more accurate calculations of regression metrics.
Independent Variable: An independent variable is a variable that is manipulated or controlled in an experiment or model to test its effects on a dependent variable. It is essential for understanding relationships between variables in statistical methods, including regression analysis. The choice of independent variable is crucial, as it can greatly influence the predictions and interpretations derived from the model.
Lasso regression: Lasso regression is a linear regression technique that incorporates L1 regularization to prevent overfitting by adding a penalty equal to the absolute value of the magnitude of coefficients. This method effectively shrinks some coefficients to zero, which not only helps in reducing model complexity but also performs variable selection. By reducing the number of features used in the model, lasso regression enhances interpretability and can improve predictive performance.
Linear regression: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It serves as a foundational technique in statistical learning, helping in understanding relationships among variables and making predictions.
Linearity: Linearity refers to the relationship between two variables where a change in one variable results in a proportional change in another. In the context of regression, this means that the model assumes that the relationship between the independent and dependent variables can be represented as a straight line, which simplifies the analysis and interpretation of data. Understanding linearity is crucial for accurately predicting outcomes and evaluating model performance.
Mean Absolute Error (MAE): Mean Absolute Error (MAE) is a measure of prediction accuracy in a statistical model, calculated as the average of the absolute differences between predicted and actual values. It provides a straightforward interpretation of the average magnitude of errors in a set of predictions, without considering their direction, making it useful for understanding overall model performance.
Mean Squared Error (MSE): Mean Squared Error (MSE) is a common measure used to evaluate the accuracy of a predictive model, defined as the average of the squares of the errors—that is, the average squared difference between the predicted and actual values. This metric provides insights into how well a regression model performs, indicating the extent of deviation from the true values and allowing comparisons with other metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). By assessing MSE, one can also gain insights into the overall goodness-of-fit for a model, connecting it to R-squared for a comprehensive evaluation.
Overfitting: Overfitting occurs when a statistical model or machine learning algorithm captures noise or random fluctuations in the training data instead of the underlying patterns, leading to poor generalization to new, unseen data. This results in a model that performs exceptionally well on training data but fails to predict accurately on validation or test sets.
R-squared: R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that can be explained by one or more independent variables in a regression model. It helps evaluate the effectiveness of a model and is crucial for understanding model diagnostics, bias-variance tradeoff, and regression metrics.
Residuals: Residuals are the differences between the observed values and the predicted values in a regression model. They provide crucial insight into how well a model is performing by indicating the errors in prediction for each data point. Analyzing residuals helps in assessing the model's accuracy, identifying patterns, and checking assumptions of linear regression.
Ridge regression: Ridge regression is a type of linear regression that incorporates L2 regularization to prevent overfitting by adding a penalty equal to the square of the magnitude of coefficients. This approach helps manage multicollinearity in multiple linear regression models and improves prediction accuracy, especially when dealing with high-dimensional data. Ridge regression is closely related to other regularization techniques and model evaluation criteria, making it a key concept in statistical modeling and machine learning.
Root mean squared error (RMSE): Root mean squared error (RMSE) is a widely used metric to evaluate the accuracy of a predictive model, calculated as the square root of the average of the squared differences between predicted and observed values. RMSE provides a clear indication of how well a model's predictions align with actual outcomes, making it a critical measure when assessing the performance of regression models. By taking the square root of the mean squared error (MSE), RMSE emphasizes larger errors more than smaller ones, which can be useful in understanding prediction quality.
Train-test split: Train-test split is a method used in machine learning to evaluate the performance of a model by dividing the dataset into two separate subsets: one for training the model and another for testing its performance. This technique is crucial because it helps to prevent overfitting, ensuring that the model generalizes well to unseen data. By using a portion of the data for training and another for validation, it provides insights into how well the model can make predictions in real-world scenarios.
Underfitting: Underfitting occurs when a statistical model is too simple to capture the underlying structure of the data, resulting in poor predictive performance. This typically happens when the model has high bias and fails to account for the complexity of the data, leading to systematic errors in both training and test datasets.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.