upgrade
upgrade

🤝Collaborative Data Science

Data Preprocessing Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data preprocessing is where reproducible data science lives or dies. You're being tested on your ability to transform messy, real-world data into analysis-ready datasets—and more importantly, to document every decision so collaborators can understand and replicate your work. The techniques here connect directly to core concepts like statistical validity, model assumptions, algorithmic fairness, and computational reproducibility.

Think of preprocessing as the bridge between raw data and trustworthy results. Each technique addresses a specific problem: missing data threatens statistical power, outliers violate model assumptions, and inconsistent scales break distance-based algorithms. Don't just memorize what each technique does—know when to apply it, why it matters for your analysis, and how to document it for your team.


Ensuring Data Quality and Integrity

Before any analysis begins, your dataset needs to be accurate and consistent. Garbage in, garbage out isn't just a cliché—it's a fundamental principle of statistical inference.

Data Cleaning

  • Identifies and corrects errors—typos, inconsistent formats, and duplicate records that compromise analysis validity
  • Standardizes formats across variables, ensuring dates, categories, and text fields follow consistent conventions
  • Documents all corrections in version-controlled scripts, making your cleaning process fully reproducible for collaborators

Handling Missing Values

  • Missingness mechanisms matter—data missing completely at random (MCAR), at random (MAR), or not at random (MNAR) require different approaches
  • Imputation methods include mean/median substitution, regression imputation, and multiple imputation for preserving statistical properties
  • Deletion strategies like listwise or pairwise deletion trade sample size for simplicity—document your choice and justify it

Data Integration and Merging

  • Combines multiple data sources using keys or identifiers, requiring careful attention to join types (inner, outer, left, right)
  • Resolves semantic conflicts when the same concept has different names or formats across sources
  • Creates audit trails documenting source provenance, merge logic, and any records lost or duplicated in the process

Compare: Data cleaning vs. handling missing values—both address data quality, but cleaning fixes incorrect values while missing value handling addresses absent values. FRQs often ask you to distinguish between these and justify your approach for each.


Addressing Statistical Assumptions

Many statistical models and machine learning algorithms assume your data meets certain conditions. These techniques help you get there—or help you choose models that don't require these assumptions.

Outlier Detection and Treatment

  • Detection methods include z-scores (values beyond ±3\pm 3 standard deviations), IQR-based rules, and visual tools like box plots
  • Treatment decisions depend on context—removal, winsorization (capping), or transformation each have tradeoffs for bias and variance
  • Domain knowledge matters—an "outlier" might be your most important observation, so never automate removal without investigation

Data Transformation

  • Log transformation stabilizes variance and linearizes exponential relationships, converting y=abxy = ab^x to log(y)=log(a)+xlog(b)\log(y) = \log(a) + x\log(b)
  • Box-Cox transformation finds the optimal power transformation to achieve normality, with λ=0\lambda = 0 equivalent to log transformation
  • Square root transformation is useful for count data and Poisson-distributed variables where variance scales with the mean

Compare: Outlier treatment vs. data transformation—outlier treatment targets individual extreme values, while transformation reshapes the entire distribution. If an FRQ asks about meeting normality assumptions, transformation is usually your answer; if it asks about influential points, discuss outliers.


Preparing Features for Algorithms

Different algorithms have different requirements for how features are represented and scaled. Getting this wrong can silently break your model.

Feature Scaling

  • Standardization transforms features to mean μ=0\mu = 0 and standard deviation σ=1\sigma = 1 using z=xμσz = \frac{x - \mu}{\sigma}
  • Normalization rescales to [0,1][0, 1] using x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}, sensitive to outliers
  • Essential for distance-based algorithms like k-means, KNN, and gradient descent optimization—unscaled features with larger ranges dominate calculations

Encoding Categorical Variables

  • One-hot encoding creates binary indicator columns for each category, avoiding false ordinal relationships
  • Label encoding assigns integers to categories—appropriate only for ordinal variables or tree-based models
  • High-cardinality categories may require target encoding or embedding techniques to avoid dimensionality explosion

Compare: Standardization vs. normalization—both are scaling methods, but standardization preserves outlier information while normalization compresses everything to a fixed range. Use standardization for algorithms assuming Gaussian distributions; use normalization when you need bounded values.


Reducing Complexity and Redundancy

High-dimensional data creates computational challenges and can degrade model performance through the curse of dimensionality. These techniques help you work smarter, not harder.

Feature Selection

  • Filter methods use statistical tests like correlation coefficients or mutual information to rank features independently of any model
  • Wrapper methods like recursive feature elimination (RFE) evaluate feature subsets using model performance, but are computationally expensive
  • Embedded methods like LASSO (L1L_1 regularization) perform selection during model training by shrinking irrelevant coefficients to zero

Dimensionality Reduction

  • PCA finds orthogonal components that maximize variance, with each component a linear combination of original features
  • t-SNE preserves local neighborhood structure for visualization but doesn't provide interpretable components or work on new data
  • Variance explained metrics help you choose how many components to retain—typically aiming for 80-95% cumulative variance

Compare: Feature selection vs. dimensionality reduction—selection keeps original features (interpretable), while reduction creates new synthetic features (often more powerful but harder to explain). If interpretability matters for your analysis, prefer selection; if prediction performance is paramount, consider reduction.


Handling Class Imbalance

When your outcome variable has unequal class frequencies, standard algorithms optimize for the majority class and ignore the minority—often the class you care most about.

Handling Imbalanced Datasets

  • Oversampling techniques like SMOTE generate synthetic minority examples by interpolating between existing observations
  • Undersampling randomly removes majority class examples, risking information loss but reducing computational cost
  • Algorithmic approaches include class weights, cost-sensitive learning, and evaluation metrics like F1-score and AUC-ROC that don't reward majority-class accuracy

Compare: Oversampling vs. undersampling—oversampling preserves all your data but can cause overfitting to synthetic examples; undersampling avoids this but discards potentially useful information. SMOTE is often the default choice, but always validate with cross-validation that doesn't leak synthetic data into test sets.


Quick Reference Table

ConceptBest Examples
Data qualityData cleaning, handling missing values, data integration
Statistical assumptionsOutlier treatment, log transformation, Box-Cox
Algorithm requirementsFeature scaling, categorical encoding
Complexity reductionFeature selection, PCA, t-SNE
Class imbalanceSMOTE, undersampling, class weights
ReproducibilityVersion-controlled scripts, documented decisions, audit trails
Distance-based methodsStandardization, normalization, one-hot encoding
InterpretabilityFeature selection, filter methods, original feature retention

Self-Check Questions

  1. You're building a k-means clustering model and notice one feature ranges from 0-1 while another ranges from 0-100,000. Which preprocessing technique is essential here, and would you choose standardization or normalization? Why?

  2. Compare and contrast how you would handle a dataset with 5% missing values versus one with 40% missing values. What factors influence your choice of imputation vs. deletion?

  3. A collaborator sends you a dataset where they removed all outliers but didn't document which observations were removed or why. What reproducibility problems does this create, and how should this have been handled?

  4. You have a categorical variable "country" with 195 unique values. Compare one-hot encoding vs. label encoding for this feature—which would you choose and what problems might each approach cause?

  5. Your binary classification target has 95% negative cases and 5% positive cases. If an FRQ asks you to preprocess this data for a logistic regression model, which techniques would you apply and in what order?