🧮Data Science Numerical Analysis Unit 2 – Interpolation & Curve Fitting

Interpolation and curve fitting are essential techniques in data science for estimating values and understanding relationships between variables. These methods allow us to create continuous functions from discrete data points, enabling predictions and insights from limited information. Interpolation estimates values between known data points, while curve fitting finds a function that best approximates a set of data. Both techniques are crucial for filling gaps in data, smoothing out noise, and revealing underlying patterns in various fields like science, engineering, and finance.

Study Guides for Unit 2

2.1

Polynomial interpolation

8 min read

2.2

Spline interpolation

8 min read

2.3

Least squares approximation

9 min read

2.4

Regression analysis

12 min read

2.5

Smoothing techniques

11 min read

What's Interpolation & Curve Fitting?

Interpolation estimates values between known data points by constructing new data points within the range of a discrete set of known data points
Curve fitting finds a curve that best fits a series of data points, potentially including points outside the range of the original data
Interpolation and curve fitting are techniques used to create continuous functions from discrete data points
Interpolation is used when the function is known at certain points and the goal is to estimate values between those points
Curve fitting is used when the underlying function is unknown and the goal is to find a function that best approximates the data
Interpolation assumes the data follows a specific pattern or function, while curve fitting tries to discover the underlying pattern or function
Interpolation is an exact fit to the data points, while curve fitting is an approximation that minimizes the difference between the fitted curve and the data points

Why Do We Need It?

Interpolation and curve fitting allow us to estimate values that were not directly measured or observed
These techniques enable us to fill in missing data points, which is useful when data collection is expensive, time-consuming, or impossible
Interpolation is necessary when we need to know the value of a function at a specific point that falls between measured data points
- For example, if we have temperature readings at 9 AM and 11 AM, interpolation can estimate the temperature at 10 AM
Curve fitting helps us understand the underlying relationship between variables and make predictions based on that relationship
Fitting a curve to data allows us to smooth out noise and irregularities in the data, making it easier to identify trends and patterns
Interpolation and curve fitting are essential for creating continuous models from discrete data, which is necessary for many applications in science, engineering, and data analysis
These techniques enable us to make informed decisions and predictions based on limited data

Key Concepts & Terminology

Interpolation nodes are the known data points used to construct the interpolation function
Interpolation interval is the range between two consecutive interpolation nodes
Extrapolation estimates values outside the range of known data points, while interpolation estimates values within the range
Overfitting occurs when a curve fits the noise in the data rather than the underlying relationship, leading to poor generalization
Underfitting occurs when a curve is too simple to capture the underlying relationship in the data
Goodness of fit measures how well a curve fits the data, often using metrics like mean squared error (MSE) or coefficient of determination (R²)
Smoothing reduces noise and irregularities in the data to reveal the underlying trend or relationship
Spline interpolation uses low-degree polynomials to create a smooth curve that passes through all the data points

Common Interpolation Methods

Linear interpolation connects adjacent data points with straight lines, resulting in a piecewise linear function
- It is the simplest interpolation method but may not provide a smooth curve
Polynomial interpolation fits a polynomial function of degree n to n+1 data points
- Higher-degree polynomials can fit the data more closely but may lead to overfitting and oscillations
Lagrange interpolation is a polynomial interpolation method that constructs a polynomial of degree n-1 for n data points
- It is easy to implement but can be computationally expensive for large datasets
Newton's divided difference interpolation is another polynomial interpolation method that uses divided differences to construct the interpolating polynomial
Cubic spline interpolation uses piecewise cubic polynomials to create a smooth curve that passes through all the data points
- It ensures continuity and smoothness at the interpolation nodes
Hermite interpolation uses both the function values and derivatives at the interpolation nodes to construct a polynomial that matches both
Trigonometric interpolation uses trigonometric functions (sine and cosine) to interpolate periodic data

Curve Fitting Techniques

Least squares regression minimizes the sum of the squared differences between the observed data and the fitted curve
- It is commonly used for linear regression, polynomial regression, and multiple linear regression
Linear regression fits a linear function ( $y = mx + b$ ) to the data, where m is the slope and b is the y-intercept
Polynomial regression fits a polynomial function of degree n to the data, where n is chosen to balance fitting the data and avoiding overfitting
Multiple linear regression fits a linear function with multiple independent variables to the data
Nonlinear regression fits a nonlinear function to the data, such as exponential, logarithmic, or sinusoidal functions
Robust regression methods, such as RANSAC (Random Sample Consensus), are less sensitive to outliers in the data
Regularization techniques, such as Ridge regression and Lasso regression, add a penalty term to the least squares objective function to prevent overfitting

Error Analysis & Accuracy

Interpolation error is the difference between the interpolated value and the true value of the function at that point
Approximation error is the difference between the fitted curve and the true underlying function
Mean squared error (MSE) measures the average squared difference between the observed data and the fitted curve
- A smaller MSE indicates a better fit
Root mean squared error (RMSE) is the square root of the MSE and has the same units as the dependent variable
Mean absolute error (MAE) measures the average absolute difference between the observed data and the fitted curve
Coefficient of determination (R²) measures the proportion of variance in the dependent variable that is predictable from the independent variable(s)
- An R² value closer to 1 indicates a better fit
Cross-validation techniques, such as k-fold cross-validation, assess the accuracy of the fitted model on unseen data
Residual analysis examines the differences between the observed data and the fitted curve to check for patterns or trends that indicate a poor fit

Real-World Applications

Interpolation is used in digital signal processing to increase the sampling rate of a signal (upsampling) or to estimate missing samples
In image processing, interpolation methods like bilinear and bicubic interpolation are used to resize images and estimate pixel values
Curve fitting is used in finance to model the relationship between risk and return, and to estimate the value of derivatives
In physics, curve fitting is used to determine the laws governing physical phenomena from experimental data
Interpolation and curve fitting are used in weather forecasting to estimate values between weather stations and to create continuous weather maps
In machine learning, curve fitting techniques are used to train models that can make predictions based on input data
Interpolation is used in computer graphics to create smooth animations and to estimate values between keyframes
In engineering, interpolation and curve fitting are used to create empirical models of physical systems based on measured data

Tools & Software for Implementation

MATLAB provides built-in functions for interpolation (
```
interp1
```
,
```
interp2
```
,
```
interp3
```
) and curve fitting (
```
polyfit
```
,
```
lsqcurvefit
```
)
Python libraries like NumPy, SciPy, and Pandas offer functions for interpolation (
```
np.interp
```
,
```
scipy.interpolate
```
) and curve fitting (
```
np.polyfit
```
,
```
scipy.optimize.curve_fit
```
)
R has functions for interpolation (
```
approx
```
,
```
spline
```
) and curve fitting (
```
lm
```
,
```
nls
```
) in its base package and additional capabilities in libraries like
```
stats
```
and
```
zoo
```
Excel provides built-in functions for linear interpolation (
```
FORECAST
```
,
```
TREND
```
) and can perform curve fitting using the Trendline feature
Wolfram Mathematica has extensive capabilities for interpolation (
```
Interpolation
```
) and curve fitting (
```
Fit
```
,
```
NonlinearModelFit
```
) with support for various methods and options
LabVIEW offers interpolation and curve fitting functions in its Signal Processing and Math & Scientific Constants palettes
C++ libraries like Boost and GNU Scientific Library (GSL) provide functions for interpolation and curve fitting that can be integrated into custom software
Many specialized software packages for scientific computing, such as Origin, SigmaPlot, and GraphPad Prism, include tools for interpolation and curve fitting