upgrade
upgrade

🧮Data Science Numerical Analysis

Data Preprocessing Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data preprocessing isn't just busywork before the "real" analysis begins—it's where numerical analysis meets practical data science. You're being tested on your understanding of how raw data transforms into algorithm-ready inputs, and why certain preprocessing choices can make or break model performance. The mathematical foundations here connect directly to concepts like numerical stability, convergence rates, distance metrics, and variance-bias tradeoffs that appear throughout your coursework.

Every preprocessing technique addresses a specific numerical or statistical problem: scaling prevents features from dominating distance calculations, transformations stabilize variance for parametric methods, and dimensionality reduction tackles the curse of dimensionality. Don't just memorize that you should "normalize your data"—know when min-max scaling beats standardization, why log transforms help skewed distributions, and how missing value strategies affect your downstream statistics.


Scaling and Normalization for Numerical Stability

Many algorithms depend on distance calculations or gradient-based optimization, making feature scales critically important. When features exist on vastly different scales, algorithms can become numerically unstable or converge slowly.

Data Normalization

  • Min-Max scaling transforms values to a fixed range (typically [0,1][0, 1]) using x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}—preserves zero values and bounded outputs
  • Z-score normalization centers data with x=xμσx' = \frac{x - \mu}{\sigma}, producing mean = 0 and standard deviation = 1
  • Distance-based algorithms like k-NN and k-means clustering require normalized features to prevent high-magnitude variables from dominating similarity calculations

Feature Scaling

  • Standardization is preferred when data contains outliers since it doesn't bound values to a specific range
  • Gradient descent convergence improves dramatically with scaled features—unscaled data creates elongated contours that slow optimization
  • Neural networks and SVMs are particularly sensitive to feature scales; tree-based methods like Random Forests are generally scale-invariant

Compare: Min-Max Scaling vs. Z-score Standardization—both rescale features, but min-max bounds output to [0,1][0, 1] while z-score allows unbounded values. Use min-max for algorithms requiring bounded inputs; use z-score when outliers are present or when you need interpretable standard deviations.


Handling Data Quality Issues

Real-world datasets arrive messy. Addressing quality issues before analysis prevents garbage-in-garbage-out scenarios and ensures your numerical methods operate on valid inputs.

Data Cleaning

  • Duplicate removal and error correction directly impact statistical estimates—duplicates inflate sample size artificially and skew distributions
  • Validation rules catch impossible values (negative ages, future dates) that would otherwise propagate through calculations
  • Data profiling provides summary statistics and distributions to identify systematic issues before they corrupt downstream analysis

Handling Missing Values

  • Deletion strategies include listwise deletion (removes entire rows) and pairwise deletion (uses available data per calculation)—both reduce sample size and may introduce bias
  • Imputation methods range from simple (mean, median, mode) to sophisticated (k-NN imputation, multiple imputation, regression-based prediction)
  • Missing data mechanisms matter: MCAR (missing completely at random) allows simple imputation, while MNAR (missing not at random) requires careful modeling to avoid bias

Outlier Detection and Treatment

  • Z-score method flags points where z>3|z| > 3; the IQR method identifies values below Q11.5×IQRQ_1 - 1.5 \times IQR or above Q3+1.5×IQRQ_3 + 1.5 \times IQR
  • Outlier impact on statistics varies—means and standard deviations are sensitive; medians and IQR are robust
  • Treatment options include removal, winsorization (capping at percentiles), or transformation—choice depends on whether outliers represent errors or genuine extreme values

Compare: Mean Imputation vs. Multiple Imputation—mean imputation is fast but underestimates variance and distorts correlations. Multiple imputation preserves uncertainty by creating several plausible datasets. If an FRQ asks about imputation effects on standard errors, multiple imputation is your go-to example.


Transformations for Statistical Assumptions

Many statistical methods assume normally distributed data with constant variance. Transformations reshape distributions to better satisfy these assumptions and improve numerical behavior.

Data Transformation

  • Log transformation (x=log(x)x' = \log(x)) compresses right-skewed data and is ideal for multiplicative relationships—common in financial and biological data
  • Box-Cox transformation finds the optimal power parameter λ\lambda in x=xλ1λx' = \frac{x^\lambda - 1}{\lambda} to maximize normality
  • Variance stabilization is critical for methods like linear regression and ANOVA that assume homoscedasticity (constant variance across groups)

Data Discretization

  • Equal-width binning divides the range into kk bins of size xmaxxmink\frac{x_{max} - x_{min}}{k}—simple but sensitive to outliers
  • Equal-frequency binning ensures each bin contains approximately the same number of observations—better for skewed distributions
  • Information loss tradeoff: discretization simplifies models and can improve interpretability, but destroys fine-grained numerical information

Compare: Log Transformation vs. Box-Cox—log is a special case of Box-Cox where λ=0\lambda = 0. Box-Cox is more flexible but requires positive data and parameter estimation. When you know data is right-skewed with multiplicative effects, log is simpler; when seeking optimal normality, Box-Cox provides data-driven selection.


Dimensionality Reduction and Feature Engineering

High-dimensional data creates computational challenges and statistical problems like overfitting. Reducing dimensions while preserving information is both an art and a mathematically rigorous process.

Dimensionality Reduction

  • Principal Component Analysis (PCA) finds orthogonal directions of maximum variance using eigendecomposition of the covariance matrix Σ\Sigma
  • t-SNE preserves local neighborhood structure for visualization but is non-deterministic and computationally expensive—not suitable for preprocessing before prediction
  • Curse of dimensionality causes distance metrics to become meaningless in high dimensions; PCA mitigates this by projecting to a lower-dimensional subspace

Feature Selection

  • Filter methods rank features by statistical measures (correlation, mutual information, chi-squared) independent of any model—fast but may miss feature interactions
  • Wrapper methods evaluate feature subsets using model performance (forward selection, backward elimination)—accurate but computationally expensive
  • Embedded methods perform selection during model training (LASSO regularization with L1L_1 penalty, tree-based importance)—balance between accuracy and efficiency

Compare: PCA vs. Feature Selection—PCA creates new composite features (principal components) that are linear combinations of originals, while feature selection keeps original features intact. PCA maximizes variance explained; feature selection maximizes predictive relevance. For interpretability, feature selection wins; for maximum variance retention, use PCA.


Encoding for Algorithm Compatibility

Machine learning algorithms require numerical inputs, but real data includes categorical variables. Encoding schemes convert categories to numbers while preserving (or intentionally ignoring) ordinal relationships.

Encoding Categorical Variables

  • One-hot encoding creates binary columns for each category—avoids imposing false ordinal relationships but increases dimensionality by k1k-1 for kk categories
  • Label encoding assigns integers to categories—suitable for ordinal data or tree-based models that can learn arbitrary splits
  • Target encoding replaces categories with mean target values—powerful but prone to overfitting and data leakage if not implemented with proper cross-validation

Compare: One-Hot vs. Label Encoding—one-hot prevents algorithms from inferring false ordering (red < blue < green) but explodes feature space for high-cardinality variables. Label encoding is compact but implies ordinality. For nominal categories with few levels, use one-hot; for ordinal data or trees, label encoding works well.


Quick Reference Table

ConceptBest Examples
Distance-sensitive scalingMin-Max normalization, Z-score standardization
Gradient descent optimizationFeature scaling, standardization
Missing data handlingMean/median imputation, multiple imputation, k-NN imputation
Distributional assumptionsLog transformation, Box-Cox, square root
Outlier identificationZ-score method, IQR method, box plots
Variance reductionPCA, feature selection, LASSO
Categorical conversionOne-hot encoding, label encoding, target encoding
Continuous-to-discreteEqual-width binning, equal-frequency binning

Self-Check Questions

  1. Which two preprocessing methods both involve rescaling features but differ in how they handle outliers? Explain when you'd choose each.

  2. A dataset has 15% missing values in a key predictor. Compare mean imputation versus multiple imputation—how would each affect your standard error estimates?

  3. You're preparing data for k-means clustering with features ranging from [0,1][0, 1] to [0,100000][0, 100000]. Which preprocessing step is essential, and what numerical problem does it solve?

  4. FRQ-style: A variable showing right-skewed count data violates the normality assumption for linear regression. Identify two transformation approaches and explain the tradeoff between them.

  5. When would you choose PCA over feature selection for dimensionality reduction? Consider both interpretability and the mathematical properties of each method.