🧮Data Science Numerical Analysis

Data Preprocessing Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data preprocessing isn't just busywork before the "real" analysis begins—it's where numerical analysis meets practical data science. You're being tested on your understanding of how raw data transforms into algorithm-ready inputs, and why certain preprocessing choices can make or break model performance. The mathematical foundations here connect directly to concepts like numerical stability, convergence rates, distance metrics, and variance-bias tradeoffs that appear throughout your coursework.

Every preprocessing technique addresses a specific numerical or statistical problem: scaling prevents features from dominating distance calculations, transformations stabilize variance for parametric methods, and dimensionality reduction tackles the curse of dimensionality. Don't just memorize that you should "normalize your data"—know when min-max scaling beats standardization, why log transforms help skewed distributions, and how missing value strategies affect your downstream statistics.

Scaling and Normalization for Numerical Stability

Many algorithms depend on distance calculations or gradient-based optimization, making feature scales critically important. When features exist on vastly different scales, algorithms can become numerically unstable or converge slowly.

Data Normalization

Min-Max scaling transforms values to a fixed range (typically $[0, 1]$ ) using $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$ —preserves zero values and bounded outputs
Z-score normalization centers data with $x' = \frac{x - \mu}{\sigma}$ , producing mean = 0 and standard deviation = 1
Distance-based algorithms like k-NN and k-means clustering require normalized features to prevent high-magnitude variables from dominating similarity calculations

Feature Scaling

Standardization is preferred when data contains outliers since it doesn't bound values to a specific range
Gradient descent convergence improves dramatically with scaled features—unscaled data creates elongated contours that slow optimization
Neural networks and SVMs are particularly sensitive to feature scales; tree-based methods like Random Forests are generally scale-invariant

Compare: Min-Max Scaling vs. Z-score Standardization—both rescale features, but min-max bounds output to $[0, 1]$ while z-score allows unbounded values. Use min-max for algorithms requiring bounded inputs; use z-score when outliers are present or when you need interpretable standard deviations.

Handling Data Quality Issues

Real-world datasets arrive messy. Addressing quality issues before analysis prevents garbage-in-garbage-out scenarios and ensures your numerical methods operate on valid inputs.

Data Cleaning

Duplicate removal and error correction directly impact statistical estimates—duplicates inflate sample size artificially and skew distributions
Validation rules catch impossible values (negative ages, future dates) that would otherwise propagate through calculations
Data profiling provides summary statistics and distributions to identify systematic issues before they corrupt downstream analysis

Handling Missing Values

Deletion strategies include listwise deletion (removes entire rows) and pairwise deletion (uses available data per calculation)—both reduce sample size and may introduce bias
Imputation methods range from simple (mean, median, mode) to sophisticated (k-NN imputation, multiple imputation, regression-based prediction)
Missing data mechanisms matter: MCAR (missing completely at random) allows simple imputation, while MNAR (missing not at random) requires careful modeling to avoid bias

Outlier Detection and Treatment

Z-score method flags points where $|z| > 3$ ; the IQR method identifies values below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$
Outlier impact on statistics varies—means and standard deviations are sensitive; medians and IQR are robust
Treatment options include removal, winsorization (capping at percentiles), or transformation—choice depends on whether outliers represent errors or genuine extreme values

Compare: Mean Imputation vs. Multiple Imputation—mean imputation is fast but underestimates variance and distorts correlations. Multiple imputation preserves uncertainty by creating several plausible datasets. If an FRQ asks about imputation effects on standard errors, multiple imputation is your go-to example.

Transformations for Statistical Assumptions

Many statistical methods assume normally distributed data with constant variance. Transformations reshape distributions to better satisfy these assumptions and improve numerical behavior.

Data Transformation

Log transformation ( $x' = \log(x)$ ) compresses right-skewed data and is ideal for multiplicative relationships—common in financial and biological data
Box-Cox transformation finds the optimal power parameter $\lambda$ in $x' = \frac{x^\lambda - 1}{\lambda}$ to maximize normality
Variance stabilization is critical for methods like linear regression and ANOVA that assume homoscedasticity (constant variance across groups)

Data Discretization

Equal-width binning divides the range into $k$ bins of size $\frac{x_{max} - x_{min}}{k}$ —simple but sensitive to outliers
Equal-frequency binning ensures each bin contains approximately the same number of observations—better for skewed distributions
Information loss tradeoff: discretization simplifies models and can improve interpretability, but destroys fine-grained numerical information

Compare: Log Transformation vs. Box-Cox—log is a special case of Box-Cox where $\lambda = 0$ . Box-Cox is more flexible but requires positive data and parameter estimation. When you know data is right-skewed with multiplicative effects, log is simpler; when seeking optimal normality, Box-Cox provides data-driven selection.

Dimensionality Reduction and Feature Engineering

High-dimensional data creates computational challenges and statistical problems like overfitting. Reducing dimensions while preserving information is both an art and a mathematically rigorous process.

Dimensionality Reduction

Principal Component Analysis (PCA) finds orthogonal directions of maximum variance using eigendecomposition of the covariance matrix $\Sigma$
t-SNE preserves local neighborhood structure for visualization but is non-deterministic and computationally expensive—not suitable for preprocessing before prediction
Curse of dimensionality causes distance metrics to become meaningless in high dimensions; PCA mitigates this by projecting to a lower-dimensional subspace

Feature Selection

Filter methods rank features by statistical measures (correlation, mutual information, chi-squared) independent of any model—fast but may miss feature interactions
Wrapper methods evaluate feature subsets using model performance (forward selection, backward elimination)—accurate but computationally expensive
Embedded methods perform selection during model training (LASSO regularization with $L_1$ penalty, tree-based importance)—balance between accuracy and efficiency

Compare: PCA vs. Feature Selection—PCA creates new composite features (principal components) that are linear combinations of originals, while feature selection keeps original features intact. PCA maximizes variance explained; feature selection maximizes predictive relevance. For interpretability, feature selection wins; for maximum variance retention, use PCA.

Encoding for Algorithm Compatibility

Machine learning algorithms require numerical inputs, but real data includes categorical variables. Encoding schemes convert categories to numbers while preserving (or intentionally ignoring) ordinal relationships.

Encoding Categorical Variables

One-hot encoding creates binary columns for each category—avoids imposing false ordinal relationships but increases dimensionality by $k-1$ for $k$ categories
Label encoding assigns integers to categories—suitable for ordinal data or tree-based models that can learn arbitrary splits
Target encoding replaces categories with mean target values—powerful but prone to overfitting and data leakage if not implemented with proper cross-validation

Compare: One-Hot vs. Label Encoding—one-hot prevents algorithms from inferring false ordering (red < blue < green) but explodes feature space for high-cardinality variables. Label encoding is compact but implies ordinality. For nominal categories with few levels, use one-hot; for ordinal data or trees, label encoding works well.

Quick Reference Table

Concept	Best Examples
Distance-sensitive scaling	Min-Max normalization, Z-score standardization
Gradient descent optimization	Feature scaling, standardization
Missing data handling	Mean/median imputation, multiple imputation, k-NN imputation
Distributional assumptions	Log transformation, Box-Cox, square root
Outlier identification	Z-score method, IQR method, box plots
Variance reduction	PCA, feature selection, LASSO
Categorical conversion	One-hot encoding, label encoding, target encoding
Continuous-to-discrete	Equal-width binning, equal-frequency binning

Self-Check Questions

Which two preprocessing methods both involve rescaling features but differ in how they handle outliers? Explain when you'd choose each.
A dataset has 15% missing values in a key predictor. Compare mean imputation versus multiple imputation—how would each affect your standard error estimates?
You're preparing data for k-means clustering with features ranging from $[0, 1]$ to $[0, 100000]$ . Which preprocessing step is essential, and what numerical problem does it solve?
FRQ-style: A variable showing right-skewed count data violates the normality assumption for linear regression. Identify two transformation approaches and explain the tradeoff between them.
When would you choose PCA over feature selection for dimensionality reduction? Consider both interpretability and the mathematical properties of each method.

🧮Data Science Numerical Analysis

Data Preprocessing Methods

Why This Matters

Scaling and Normalization for Numerical Stability

Data Normalization

Feature Scaling

Handling Data Quality Issues

Data Cleaning

Handling Missing Values

Outlier Detection and Treatment

Transformations for Statistical Assumptions

Data Transformation

Data Discretization

Dimensionality Reduction and Feature Engineering

Dimensionality Reduction

Feature Selection

Encoding for Algorithm Compatibility

Encoding Categorical Variables

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes