Light

📊Principles of Data Science Unit 7 – Supervised Learning: Regression

Regression is a powerful supervised learning technique used to predict continuous numerical values. It establishes relationships between independent variables and a dependent variable, fitting mathematical functions to training data to minimize prediction errors. Various regression models exist, from simple linear regression to more complex non-linear approaches. Key concepts include features, targets, coefficients, and regularization techniques. Understanding these elements helps data scientists choose the right model and avoid common pitfalls in real-world applications.

Study Guides for Unit 7

7.1

Linear regression

3 min read

7.2

Logistic regression

4 min read

7.3

Regularization techniques

5 min read

7.4

Advanced regression models

4 min read

What's Regression All About?

Regression is a supervised learning technique used to predict continuous numerical values
Aims to establish a relationship between independent variables (features) and a dependent variable (target)
Fits a mathematical function to the training data to minimize the difference between predicted and actual values
Can be used for forecasting, trend analysis, and understanding the impact of variables on an outcome
Assumes a linear or non-linear relationship exists between the features and the target variable
- Linear regression assumes a straight-line relationship
- Non-linear regression captures more complex relationships (polynomial, exponential, etc.)
Requires a labeled dataset with input features and corresponding target values for training
Produces a trained model that can make predictions on new, unseen data points

Types of Regression Models

Linear Regression: Assumes a linear relationship between features and the target variable
- Simple Linear Regression: One independent variable and one dependent variable
- Multiple Linear Regression: Multiple independent variables and one dependent variable
Polynomial Regression: Models non-linear relationships by adding polynomial terms to the linear equation
Ridge Regression: Linear regression with L2 regularization to handle multicollinearity and prevent overfitting
Lasso Regression: Linear regression with L1 regularization for feature selection and model simplification
Elastic Net Regression: Combines L1 and L2 regularization to balance between Lasso and Ridge regression
Stepwise Regression: Iteratively adds or removes features based on their statistical significance
Decision Tree Regression: Builds a tree-like model by splitting the data based on feature values

Key Concepts and Terminology

Features: Independent variables used to predict the target variable (denoted as X)
Target: Dependent variable that we aim to predict (denoted as y)
Coefficients: Weights assigned to each feature in the regression equation (denoted as β)
Intercept: The value of the target variable when all features are zero (denoted as β₀)
Residuals: Differences between the predicted values and the actual values
Overfitting: When a model learns the noise in the training data and fails to generalize well to new data
Underfitting: When a model is too simple to capture the underlying patterns in the data
Regularization: Techniques used to prevent overfitting by adding a penalty term to the loss function
- L1 Regularization (Lasso): Adds the absolute values of coefficients to the loss function
- L2 Regularization (Ridge): Adds the squared values of coefficients to the loss function

The Math Behind Regression

Linear Regression Equation: $y = β₀ + β₁x₁ + β₂x₂ + ... + β_nx_n + ε$ $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + ... + β_{n} x_{n} + ε$
- $y$ : Predicted target variable
- $β₀$ : Intercept
- $β₁, β₂, ..., β_n$ : Coefficients for each feature
- $x₁, x₂, ..., x_n$ : Feature values
- $ε$ : Error term (residuals)
Ordinary Least Squares (OLS): Method used to estimate the coefficients by minimizing the sum of squared residuals
- Objective: Minimize $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ , where $\hat{y}_i$ is the predicted value for the $i$ -th observation
Gradient Descent: Iterative optimization algorithm used to find the minimum of the cost function
- Updates the coefficients in the direction of steepest descent to minimize the cost function
Cost Function: Measures the difference between predicted and actual values (e.g., Mean Squared Error)
Regularization Terms:
- L1 (Lasso): $\lambda \sum_{j=1}^{p} |β_j|$
- L2 (Ridge): $\lambda \sum_{j=1}^{p} β_j^2$
- $\lambda$ : Regularization parameter that controls the strength of regularization

Implementing Regression in Python

Popular libraries: scikit-learn, statsmodels, TensorFlow, PyTorch
Preprocessing steps:
- Handling missing values (imputation or removal)
- Encoding categorical variables (one-hot encoding, label encoding)
- Scaling features (standardization, normalization)
Splitting the data into training and testing sets using
```
train_test_split
```
from scikit-learn

Creating and training the regression model:

Linear Regression:

from sklearn.linear_model import LinearRegression

Ridge Regression:
```
from sklearn.linear_model import Ridge
```
Lasso Regression:
```
from sklearn.linear_model import Lasso
```

Fitting the model to the training data using the
```
fit()
```
method
Making predictions on the testing set using the
```
predict()
```
method
Evaluating the model's performance using metrics like Mean Squared Error (MSE) or R-squared

Model Evaluation Techniques

Train-Test Split: Dividing the dataset into separate training and testing sets
- Training set: Used to train the model and learn the parameters
- Testing set: Used to evaluate the model's performance on unseen data
Cross-Validation: Technique to assess the model's performance and generalization ability
- K-Fold Cross-Validation: Divides the data into K equal-sized folds, trains and evaluates the model K times
- Leave-One-Out Cross-Validation (LOOCV): Uses each data point as a separate testing set
Evaluation Metrics:
- Mean Squared Error (MSE): Average of the squared differences between predicted and actual values
- Root Mean Squared Error (RMSE): Square root of MSE, provides an interpretable metric in the same units as the target variable
- Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values
- R-squared (R²): Proportion of the variance in the target variable explained by the model
Residual Analysis: Examining the differences between predicted and actual values to assess model assumptions and identify patterns or outliers

Real-World Applications

House Price Prediction: Estimating the price of a house based on features like area, number of rooms, location, etc.
Sales Forecasting: Predicting future sales based on historical data, seasonality, and other relevant factors
Customer Lifetime Value Prediction: Estimating the total revenue a customer will generate over their lifetime
Stock Price Prediction: Forecasting future stock prices based on historical data, market trends, and economic indicators
Weather Forecasting: Predicting temperature, precipitation, or other weather variables based on atmospheric conditions
Energy Consumption Prediction: Estimating energy usage based on factors like temperature, time of day, and building characteristics
Medical Diagnosis: Predicting the likelihood of a disease based on patient symptoms, test results, and demographic information

Common Pitfalls and How to Avoid Them

Multicollinearity: High correlation among independent variables, leading to unstable coefficients
- Solution: Remove one of the correlated variables or use regularization techniques (Ridge, Lasso)
Overfitting: Model fits the training data too closely, failing to generalize well to new data
- Solution: Use regularization, cross-validation, or simplify the model
Underfitting: Model is too simple to capture the underlying patterns in the data
- Solution: Increase model complexity, add more relevant features, or use non-linear models
Outliers: Data points that significantly deviate from the general trend, influencing the model's fit
- Solution: Identify and handle outliers appropriately (remove, transform, or use robust regression methods)
Non-linearity: Linear models may not capture non-linear relationships between features and the target variable
- Solution: Use polynomial regression, decision trees, or other non-linear models
Heteroscedasticity: Non-constant variance of the residuals across the range of predicted values
- Solution: Use weighted least squares, transform the target variable, or consider non-linear models
Autocorrelation: Correlation between the residuals in a time series or spatial data
- Solution: Use time series models (e.g., ARIMA) or incorporate spatial dependencies