Foundations of Machine Learning
Machine learning gives actuaries the ability to find patterns in complex datasets and generate predictions that traditional methods might miss. In actuarial work, this translates directly to better pricing, reserving, and risk classification. The field breaks down into a few core categories you need to understand before diving into specific algorithms.
Supervised vs. Unsupervised Learning
Supervised learning uses labeled data where the target variable is already known. The algorithm learns the mapping from inputs to outputs. In actuarial contexts, this means training on historical data where outcomes are observed: mortality rates, claim amounts, lapse indicators.
Unsupervised learning works with unlabeled data, discovering hidden structure without a predefined target. Think customer segmentation (grouping policyholders by behavior) or anomaly detection for fraud.
Semi-supervised learning sits between the two, combining a small amount of labeled data with a larger pool of unlabeled data. This is useful when labeling is expensive or time-consuming, which is common with actuarial datasets where expert judgment is needed to classify outcomes.
Regression vs. Classification Problems
- Regression predicts continuous numerical values: insurance premiums, loss reserves, expected claim costs
- Classification predicts discrete categories: risk level (preferred/standard/substandard), whether a policyholder will lapse (yes/no), fraud vs. legitimate claim
- Some algorithms handle both with minor modifications. Decision trees, for example, predict a mean value in each leaf node for regression or a majority class for classification. Neural networks simply swap the output layer and loss function.
Predictive Modeling Process
Building a predictive model isn't just about picking an algorithm. It's a structured workflow, and skipping steps leads to unreliable results. Here's how the process flows from raw data to a deployed model.
Data Preprocessing and Cleaning
Raw actuarial data is rarely ready for modeling. Preprocessing transforms it into a format algorithms can work with:
- Handle missing values by imputation (mean, median, or model-based) or removal, depending on the missingness mechanism (MCAR, MAR, MNAR)
- Encode categorical variables using one-hot encoding, target encoding, or ordinal encoding depending on the variable's nature
- Scale numerical features via standardization (z-score) or normalization (min-max) so that features on different scales don't dominate distance-based algorithms
Data cleaning goes hand-in-hand with preprocessing. You're looking for duplicate records, inconsistent coding (e.g., "M" vs. "Male"), and outliers that may reflect data entry errors rather than genuine extreme values.
Feature Selection and Engineering
Feature engineering creates new variables that capture domain knowledge the raw data doesn't express directly. For example, computing a loss ratio (incurred losses divided by earned premium) or creating an exposure measure from policy effective and termination dates. These engineered features often matter more than algorithm choice.
Feature selection then narrows down the variable set to the most predictive inputs. This reduces overfitting and makes models easier to interpret. Common approaches include filter methods (correlation, mutual information), wrapper methods (recursive feature elimination), and embedded methods (Lasso regularization).
Model Training and Validation
Training fits the algorithm to your data so it learns the underlying patterns. The standard approach:
- Split the data into training (typically 60-70%), validation (15-20%), and test (15-20%) sets
- Fit the model on the training set
- Tune and compare using the validation set
- Evaluate final performance on the held-out test set, which the model has never seen
Cross-validation provides more robust performance estimates, especially with smaller datasets. In k-fold cross-validation, you split the data into k subsets, train on k-1 folds, and validate on the remaining fold, rotating through all k combinations. Stratified k-fold ensures each fold preserves the class distribution of the target variable.
Hyperparameter Tuning
Hyperparameters are settings you choose before training begins. They control model complexity and learning behavior. Examples: the learning rate in gradient boosting, maximum tree depth, or the number of hidden neurons in a neural network.
Tuning strategies:
- Grid search: exhaustively evaluates every combination from a predefined set of values. Thorough but computationally expensive.
- Random search: samples random combinations from the hyperparameter space. Often finds good solutions faster than grid search, especially when only a few hyperparameters matter most.
- Bayesian optimization: builds a probabilistic model of the objective function and selects the next hyperparameter combination to evaluate based on expected improvement. More efficient for expensive-to-evaluate models.
Model Evaluation and Selection
The metric you optimize should align with the business problem:
- Regression: Mean Squared Error (), Root Mean Squared Error (), Mean Absolute Error (),
- Classification: Accuracy, Precision, Recall, F1-score, Area Under the ROC Curve ()
A model with the best isn't automatically the right choice. If the business objective is minimizing reserve inadequacy, you might prioritize a metric that penalizes underestimation more heavily. Always weigh statistical performance against interpretability and domain requirements.
Regression Algorithms
Linear Regression
Linear regression models the target as a linear combination of input features:
Each coefficient represents the expected change in the target for a one-unit increase in , holding other features constant. This direct interpretability makes linear regression a natural starting point for actuarial models.
The main limitation: it assumes a linear relationship. If the true relationship is curved or involves interactions, a plain linear model will underfit.
Polynomial Regression
Polynomial regression extends the linear framework by adding higher-order terms:
This lets the model capture non-linear patterns. The degree controls complexity. A degree-2 polynomial fits a parabola; higher degrees fit increasingly flexible curves. Be cautious with high-degree polynomials because they tend to overfit, especially at the edges of the data range.
Regularization Techniques
Regularization adds a penalty to the loss function that discourages large coefficients, controlling model complexity:
- Lasso (L1): Adds to the loss. Tends to drive some coefficients exactly to zero, performing automatic feature selection.
- Ridge (L2): Adds to the loss. Shrinks all coefficients toward zero but rarely eliminates any entirely.
- Elastic Net: Combines both penalties, , balancing feature selection with coefficient shrinkage.
The regularization parameter controls the penalty strength. Higher means simpler models with smaller coefficients.
Tree-Based Regression Models
Decision trees partition the feature space into rectangular regions by recursively splitting on the most informative feature at each node. Predictions within each region are the mean target value of the training observations that fall there.
Single trees are intuitive and easy to visualize, but they're prone to high variance (small changes in data can produce very different trees). Ensemble methods address this:
- Random forests train many trees on bootstrapped samples with random feature subsets, then average predictions. This reduces variance substantially.
- Gradient boosting builds trees sequentially, with each new tree fitting the residual errors of the combined model so far. This reduces both bias and variance.

Classification Algorithms
Logistic Regression
Despite its name, logistic regression is a classification algorithm. It models the probability that an observation belongs to a class using the logistic (sigmoid) function:
The model assumes a linear relationship between features and the log-odds of the target. Coefficients are directly interpretable: gives the odds ratio for a one-unit increase in . This interpretability makes logistic regression widely used in actuarial classification tasks like distinguishing high-risk from low-risk policyholders.
Decision Trees and Random Forests
Decision trees for classification work similarly to regression trees, but each leaf predicts a class (typically the majority class of training observations in that region). Splits are chosen to maximize information gain or minimize Gini impurity.
Random forests aggregate many decorrelated trees. Each tree is trained on a bootstrap sample, and at each split, only a random subset of features is considered. The final prediction is the majority vote across all trees. This ensemble approach dramatically reduces overfitting compared to a single tree while maintaining the ability to capture non-linear relationships and feature interactions.
Support Vector Machines (SVM)
SVMs find the hyperplane that maximally separates classes in feature space. The "margin" is the distance between the hyperplane and the nearest data points from each class (the support vectors). Maximizing this margin tends to produce good generalization.
For non-linearly separable data, SVMs use kernel functions (polynomial, radial basis function) to implicitly map features into a higher-dimensional space where a linear separator exists. SVMs perform well in high-dimensional settings but can be sensitive to the choice of kernel and regularization parameter .
Naive Bayes Classifier
Naive Bayes applies Bayes' theorem to compute the posterior probability of each class:
The "naive" part is the assumption that all features are conditionally independent given the class. This assumption rarely holds perfectly, but the classifier still performs surprisingly well in many practical settings, especially with high-dimensional data. It's fast to train and requires very little data to estimate parameters.
K-Nearest Neighbors (KNN)
KNN classifies a new observation by finding the closest training observations (using a distance metric like Euclidean or Manhattan distance) and assigning the majority class among those neighbors.
KNN is simple and makes no assumptions about the data distribution, but it has drawbacks: it's computationally expensive at prediction time for large datasets (every prediction requires scanning the training data), and performance degrades in high-dimensional spaces where distance metrics become less meaningful (the "curse of dimensionality"). The choice of matters: too small and the model is noisy; too large and it oversmooths.
Ensemble Methods
Ensemble methods combine multiple models to produce predictions that are more accurate and stable than any single model alone. They work because individual models tend to make different errors, and aggregation cancels out some of that noise.
Bagging and Boosting
Bagging (bootstrap aggregating) trains each base model on a different random sample (drawn with replacement) from the training data. Predictions are combined by averaging (regression) or majority vote (classification). Random forests are the most well-known bagging method. Bagging primarily reduces variance.
Boosting trains models sequentially. Each new model focuses on the observations that previous models got wrong, by upweighting misclassified instances. Boosting primarily reduces bias, though it also reduces variance. The risk is overfitting if you run too many iterations without regularization.
Gradient Boosting Machines (GBM)
GBM is a specific boosting framework that fits each new tree to the negative gradient of the loss function (which, for squared error loss, equals the residuals). This generalizes boosting to arbitrary differentiable loss functions.
Key hyperparameters:
- Number of trees: more trees improve fit but risk overfitting
- Learning rate (shrinkage): smaller values require more trees but often generalize better
- Tree depth: shallow trees (depth 3-6) work well as weak learners
Popular implementations include XGBoost, LightGBM, and CatBoost, each adding optimizations for speed and handling of categorical features. GBMs are among the most widely used algorithms in actuarial predictive modeling due to their strong performance on tabular data.
Stacking and Blending
Stacking uses a meta-model (also called a "level-2 model") that takes the predictions of several base models as inputs and learns the optimal way to combine them. The base models are typically diverse (e.g., a GBM, a logistic regression, and a neural network), so the meta-model can exploit their complementary strengths.
Blending is simpler: it takes a weighted average of base model predictions, with weights determined by validation set performance. Both techniques can squeeze out additional predictive accuracy, but they add complexity to the pipeline.
Neural Networks and Deep Learning
Neural networks consist of layers of interconnected nodes (neurons). Each neuron applies a weighted sum of its inputs, adds a bias, and passes the result through a non-linear activation function (ReLU, sigmoid, tanh). Training adjusts the weights via backpropagation, which computes the gradient of the loss function with respect to each weight using the chain rule.
"Deep learning" refers to networks with multiple hidden layers, which can learn hierarchical representations of the data.
Feedforward Neural Networks
Feedforward networks are the most basic architecture. Information flows in one direction: input layer → hidden layer(s) → output layer. Each layer is fully connected to the next.
For regression, the output layer typically has a single neuron with a linear activation. For classification, it uses a softmax activation (multi-class) or sigmoid (binary). Feedforward networks are flexible function approximators, but they require more data and tuning than simpler algorithms to perform well.
Convolutional Neural Networks (CNN)
CNNs are designed for grid-structured data like images or regularly sampled time series. Convolutional layers apply learned filters (kernels) that slide across the input, detecting local patterns (edges, textures, shapes). Pooling layers downsample the feature maps, reducing dimensionality.
In actuarial work, CNNs have niche applications: analyzing images of property damage for claims processing, or extracting features from satellite imagery for catastrophe modeling.
![Supervised vs unsupervised learning, Lab 9. Unsupervised Learning. K-Means Clustering [CS Open CourseWare]](https://storage.googleapis.com/static.prod.fiveable.me/search-images%2F%22Supervised_vs_unsupervised_learning_in_actuarial_mathematics%3A_algorithms_labeled_data_patterns_predictive_modeling%22-machine_learning_2_.png%3Fw%3D600%26tok%3D013e3a.png)
Recurrent Neural Networks (RNN)
RNNs process sequential data by maintaining a hidden state that carries information from previous time steps. At each step, the network takes the current input and the previous hidden state to produce an output and an updated hidden state.
Standard RNNs struggle with long sequences due to the vanishing gradient problem (gradients shrink exponentially during backpropagation through time). Two variants address this:
- LSTM (Long Short-Term Memory): uses gate mechanisms (input, forget, output gates) to control what information is retained or discarded
- GRU (Gated Recurrent Unit): a simplified version of LSTM with fewer parameters, often performing comparably
Actuarial applications include modeling claim development patterns over time and forecasting mortality trends.
Autoencoders and Generative Models
Autoencoders are unsupervised networks trained to reconstruct their own input. The architecture has two parts: an encoder compresses the input into a lower-dimensional latent representation, and a decoder reconstructs the input from that representation. The bottleneck forces the network to learn the most essential features of the data. Actuarial uses include dimensionality reduction and anomaly detection (poorly reconstructed observations may be anomalous).
Variational Autoencoders (VAEs) learn a probabilistic latent space, enabling generation of new synthetic data points. Generative Adversarial Networks (GANs) pit two networks against each other (a generator and a discriminator) to produce realistic synthetic data. Both can be used to augment small actuarial datasets or simulate scenarios.
Model Interpretation and Explainability
Regulators and stakeholders increasingly require that actuarial models be explainable, not just accurate. Interpretation techniques fall into two categories: global methods that describe overall model behavior, and local methods that explain individual predictions.
Feature Importance and Selection
Feature importance quantifies how much each variable contributes to the model's predictions. Common approaches:
- Permutation importance: randomly shuffle one feature's values and measure how much model performance drops. Works for any model.
- Drop-column importance: retrain the model without each feature and compare performance. More accurate but computationally expensive.
- SHAP-based importance: average the absolute SHAP values for each feature across all observations.
These measures help actuaries identify which rating variables drive predictions and can inform feature selection to build simpler, more interpretable models.
Partial Dependence Plots (PDP)
PDPs show the marginal effect of one or two features on the model's predicted outcome, averaging over the values of all other features. For example, a PDP might show how predicted claim frequency changes as driver age varies from 16 to 80, holding all other policyholder characteristics at their observed values.
PDPs are model-agnostic and work for both regression and classification. Their main limitation is that they assume feature independence. If features are correlated, the plot may show predictions for unrealistic feature combinations (e.g., a 20-year-old with 30 years of driving experience).
Local Interpretable Model-Agnostic Explanations (LIME)
LIME explains a single prediction by fitting a simple, interpretable model in the neighborhood of that observation:
- Generate perturbed versions of the instance by slightly altering feature values
- Get the complex model's predictions for each perturbed instance
- Fit a simple model (e.g., linear regression) to these local predictions, weighted by proximity to the original instance
- The simple model's coefficients reveal which features most influenced the prediction
LIME is useful for explaining why a specific policyholder was classified as high-risk, for example. The trade-off is that explanations can be unstable if the neighborhood is defined too broadly or too narrowly.
Shapley Additive Explanations (SHAP)
SHAP values come from cooperative game theory. The idea: treat each feature as a "player" and the prediction as the "payout." A feature's SHAP value is its average marginal contribution across all possible coalitions (subsets) of features.
SHAP values satisfy three desirable properties:
- Local accuracy: SHAP values for all features sum to the difference between the prediction and the average prediction
- Consistency: if a feature's contribution increases in a new model, its SHAP value won't decrease
- Missingness: features not in the model receive a SHAP value of zero
SHAP provides both local explanations (per-observation) and global explanations (aggregated across observations), making it one of the most versatile interpretability tools available.
Deployment and Monitoring
A model that sits in a notebook isn't delivering value. Deployment puts the model into a production system where it can score new data and inform decisions in real time or batch processes.
Model Serialization and Deployment
Serialization saves the trained model's structure and parameters to a file so it can be loaded later without retraining. Common formats include pickle (Python), ONNX (cross-platform), and PMML (XML-based, widely supported by actuarial platforms).
Deployment options range from embedding the model in an API endpoint to packaging it in a Docker container for scalable cloud deployment. Containerization (Docker, Kubernetes) is especially useful because it bundles the model with its exact software dependencies, ensuring consistent behavior across environments.
Monitoring Model Performance
Once deployed, you need to track whether the model continues to perform as expected. Set up dashboards that monitor key metrics (, , F1-score) on incoming data and compare them against baseline values established during validation.
Automated alerts should trigger when performance drops below acceptable thresholds. Degradation might indicate data quality issues, changes in the business environment, or concept drift.
Concept Drift and Model Retraining
Concept drift occurs when the statistical properties of the target variable or its relationship with features changes over time. For example, a claims frequency model trained on pre-pandemic data may perform poorly on post-pandemic data because driving patterns shifted.
Detecting drift:
- Statistical tests: Kolmogorov-Smirnov test (for continuous distributions), Chi-squared test (for categorical distributions), Population Stability Index (PSI)
- Performance monitoring: a sustained decline in predictive accuracy is often the most practical signal
Responding to drift:
- Full retraining on updated data is the most straightforward approach
- Incremental learning (online learning) updates model parameters as new data arrives without retraining from scratch
- Transfer learning adapts a model trained on one domain to a related but shifted domain
Establish a retraining schedule (quarterly, annually) and supplement it with drift-triggered retraining when monitoring detects significant shifts.
Ethical Considerations in Machine Learning
Actuarial models directly affect people's access to insurance and the prices they pay. This creates a responsibility to ensure models are fair, transparent, and accountable.
Bias and Fairness in Models
Bias can enter at multiple stages:
- Training data bias: historical data may reflect past discrimination (e.g., redlining in property insurance)
- Feature bias: proxy variables can encode protected characteristics even when those characteristics are excluded (zip code as a proxy for race)
- Evaluation bias: optimizing for overall accuracy can mask poor performance for minority subgroups
Mitigation strategies include auditing training data for representativeness, applying fairness constraints during optimization (e.g., demographic parity, equalized odds), and testing model outputs across protected groups before deployment.
Privacy and Data Protection
Actuarial models often use sensitive data: health records, financial history, location data. Key regulations include GDPR (EU), HIPAA (U.S. health data), and various state-level insurance data privacy laws.
Privacy-preserving techniques:
- Differential privacy: adds calibrated noise to data or model outputs so individual records can't be reverse-engineered
- Federated learning: trains models across decentralized data sources without centralizing the raw data
- Homomorphic encryption: allows computation on encrypted data without decrypting it
Transparency and Accountability
Transparency means stakeholders can understand how the model reaches its decisions. Accountability means someone is responsible when the model causes harm.
Practical tools for promoting both:
- Model cards: standardized documentation describing the model's purpose, training data, performance metrics, and known limitations
- Datasheets for datasets: documentation of how training data was collected, its composition, and potential biases
- Algorithmic impact assessments: formal evaluations of a model's potential effects on different populations before deployment
Actuaries should engage with regulators early in the modeling process, not just at the point of filing. Building trust requires ongoing communication about what models do, how they're monitored, and what safeguards are in place.