📊Intro to Business Analytics Unit 8 – Predictive Analytics: Classification & Regression

Predictive analytics uses historical data and statistical techniques to forecast future outcomes. It combines data mining, machine learning, and AI to analyze patterns and make predictions, enabling proactive decision-making and improved business performance. Classification and regression are key components of predictive analytics. Classification models categorize data into predefined classes, while regression models estimate relationships between variables. Both techniques involve data preparation, feature engineering, and model evaluation to ensure accurate predictions.

Study Guides for Unit 8

8.1

Introduction to Predictive Modeling

4 min read

8.2

Classification Techniques

5 min read

8.3

Logistic Regression

3 min read

8.4

Model Evaluation Metrics for Classification

4 min read

What's Predictive Analytics?

Involves using historical data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes
Aims to go beyond knowing what has happened to providing a best assessment of what will happen in the future
Encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning that analyze current and historical facts
Combines techniques from data mining, statistics, modeling, machine learning, and artificial intelligence to analyze data and make predictions
Enables companies to become proactive, forward-looking, and anticipate outcomes and behaviors based upon the data
- Allows for better decisions, more efficient operations, higher profits, and more satisfied customers
Predictive analytics models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions
Delivers measurable results by applying various techniques on data to gain insights and guide decision-making

Key Concepts in Classification

Classification is a supervised learning approach in which the model learns from the data input given to it and uses this learning to classify new observations
Involves predicting a categorical target variable (class label) based on one or more predictor variables (features)
The goal is to accurately assign observations into predetermined categories or classes
Binary classification deals with two possible outcomes (spam or not spam, fraud or not fraud)
Multi-class classification involves more than two classes (categorizing images as dogs, cats, or birds)
Features are the input variables or attributes used to make predictions
- Feature selection involves identifying the most relevant features for the classification task
Class imbalance occurs when the distribution of instances across the classes is not equal, which can impact model performance
Overfitting happens when a model learns the noise in the training data to the extent that it negatively impacts the performance on new data

Regression Techniques Explained

Regression analysis is a statistical method used to examine the relationship between a dependent variable and one or more independent variables
Linear regression models the relationship between the dependent variable and independent variables as a linear equation
- Simple linear regression involves one independent variable
- Multiple linear regression involves two or more independent variables
Logistic regression is used when the dependent variable is categorical (binary or multi-class)
- Estimates the probability of an event occurring based on the independent variables
Polynomial regression models the relationship between the dependent and independent variables as an nth degree polynomial
Stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure
- Forward selection starts with no variables and adds the most significant variable at each step
- Backward elimination starts with all variables and removes the least significant variable at each step
Ridge regression and Lasso regression are techniques used to handle multicollinearity and perform feature selection
Regression trees and random forests are non-parametric approaches that can handle non-linear relationships and interactions between variables

Data Prep and Feature Engineering

Data preparation involves cleaning, transforming, and formatting raw data into a suitable format for analysis
Handling missing values by either removing instances with missing data or imputing missing values using techniques like mean, median, or mode imputation
Dealing with outliers, which are extreme values that can significantly impact the model's performance
- Outliers can be identified using statistical methods or visualization techniques
- They can be removed, transformed, or treated as a separate category
Feature scaling involves standardizing or normalizing the range of independent variables to avoid features with larger ranges dominating those with smaller ranges
- Min-max scaling rescales the features to a fixed range, usually between 0 and 1
- Z-score standardization transforms the features to have zero mean and unit variance
Encoding categorical variables by converting them into numerical representations
- One-hot encoding creates binary dummy variables for each category
- Label encoding assigns a unique numerical value to each category
Feature engineering is the process of creating new features or transforming existing features to improve model performance
- Involves domain knowledge and creativity to derive meaningful features from raw data
Dimensionality reduction techniques like Principal Component Analysis (PCA) can be used to reduce the number of features while retaining most of the information

Popular Algorithms and Models

Decision Trees are a non-parametric supervised learning method used for classification and regression
- They learn simple decision rules inferred from the data features to predict the value of a target variable
Random Forests are an ensemble learning method that constructs multiple decision trees and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees
Support Vector Machines (SVM) are a discriminative classifier formally defined by a separating hyperplane
- Given labeled training data, the algorithm outputs an optimal hyperplane which categorizes new examples
Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features
K-Nearest Neighbors (KNN) is a non-parametric method used for classification and regression
- An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors
Artificial Neural Networks (ANN) are computing systems vaguely inspired by the biological neural networks that constitute animal brains
- They are based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain
Gradient Boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees

Model Evaluation Metrics

Accuracy measures the overall correctness of the model's predictions
- Calculated as the ratio of correct predictions to the total number of predictions
Precision is the ratio of true positive predictions to the total number of positive predictions (true positives + false positives)
- Measures the model's ability to avoid false positive predictions
Recall (sensitivity) is the ratio of true positive predictions to the total number of actual positive instances (true positives + false negatives)
- Measures the model's ability to identify all positive instances
F1 score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance
Confusion matrix is a table that summarizes the model's performance in terms of true positives, true negatives, false positives, and false negatives
ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied
- The area under the ROC curve (AUC-ROC) is a measure of the model's ability to discriminate between classes
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are commonly used evaluation metrics for regression problems
- They measure the average squared difference between the predicted and actual values
R-squared ( $R^2$ ) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s)

Real-World Applications

Fraud detection in financial transactions and insurance claims
- Predictive models can identify patterns and anomalies indicative of fraudulent activities
Customer churn prediction in telecommunications and subscription-based services
- Identifying customers likely to churn allows for proactive retention strategies
Credit risk assessment in banking and lending institutions
- Predictive models help evaluate the creditworthiness of borrowers and estimate the likelihood of default
Predictive maintenance in manufacturing and industrial settings
- Analyzing sensor data and historical maintenance records to predict equipment failures and optimize maintenance schedules
Disease diagnosis and prognosis in healthcare
- Predictive models can assist in early detection, risk assessment, and treatment planning for various diseases
Sentiment analysis in social media and customer feedback
- Classifying text data into positive, negative, or neutral sentiments to gauge public opinion and customer satisfaction
Demand forecasting in retail and supply chain management
- Predicting future demand for products or services based on historical sales data, seasonality, and external factors
Recommendation systems in e-commerce and content platforms
- Personalized product or content recommendations based on user preferences and behavior

Tools and Software for Predictive Analytics

R is a programming language and software environment for statistical computing and graphics
- Provides extensive libraries and packages for data manipulation, modeling, and visualization
Python is a high-level, general-purpose programming language with a wide range of libraries for data analysis and machine learning
- Popular libraries include NumPy, Pandas, Scikit-learn, and TensorFlow
SAS (Statistical Analysis System) is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics
IBM SPSS (Statistical Package for the Social Sciences) is a software package used for interactive, or batched, statistical analysis
- Offers a user-friendly interface for data manipulation, statistical modeling, and visualization
RapidMiner is a data science software platform that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics
KNIME (Konstanz Information Miner) is an open-source data analytics, reporting, and integration platform
- Allows users to visually create data flows, selectively execute analysis steps, and investigate the results
Microsoft Azure Machine Learning is a cloud-based service that enables data scientists and developers to build, train, and deploy machine learning models
Amazon SageMaker is a fully managed platform that provides developers and data scientists with the ability to build, train, and deploy machine learning models quickly