Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Data preprocessing is where reproducible data science lives or dies. You're being tested on your ability to transform messy, real-world data into analysis-ready datasets—and more importantly, to document every decision so collaborators can understand and replicate your work. The techniques here connect directly to core concepts like statistical validity, model assumptions, algorithmic fairness, and computational reproducibility.
Think of preprocessing as the bridge between raw data and trustworthy results. Each technique addresses a specific problem: missing data threatens statistical power, outliers violate model assumptions, and inconsistent scales break distance-based algorithms. Don't just memorize what each technique does—know when to apply it, why it matters for your analysis, and how to document it for your team.
Before any analysis begins, your dataset needs to be accurate and consistent. Garbage in, garbage out isn't just a cliché—it's a fundamental principle of statistical inference.
Compare: Data cleaning vs. handling missing values—both address data quality, but cleaning fixes incorrect values while missing value handling addresses absent values. FRQs often ask you to distinguish between these and justify your approach for each.
Many statistical models and machine learning algorithms assume your data meets certain conditions. These techniques help you get there—or help you choose models that don't require these assumptions.
Compare: Outlier treatment vs. data transformation—outlier treatment targets individual extreme values, while transformation reshapes the entire distribution. If an FRQ asks about meeting normality assumptions, transformation is usually your answer; if it asks about influential points, discuss outliers.
Different algorithms have different requirements for how features are represented and scaled. Getting this wrong can silently break your model.
Compare: Standardization vs. normalization—both are scaling methods, but standardization preserves outlier information while normalization compresses everything to a fixed range. Use standardization for algorithms assuming Gaussian distributions; use normalization when you need bounded values.
High-dimensional data creates computational challenges and can degrade model performance through the curse of dimensionality. These techniques help you work smarter, not harder.
Compare: Feature selection vs. dimensionality reduction—selection keeps original features (interpretable), while reduction creates new synthetic features (often more powerful but harder to explain). If interpretability matters for your analysis, prefer selection; if prediction performance is paramount, consider reduction.
When your outcome variable has unequal class frequencies, standard algorithms optimize for the majority class and ignore the minority—often the class you care most about.
Compare: Oversampling vs. undersampling—oversampling preserves all your data but can cause overfitting to synthetic examples; undersampling avoids this but discards potentially useful information. SMOTE is often the default choice, but always validate with cross-validation that doesn't leak synthetic data into test sets.
| Concept | Best Examples |
|---|---|
| Data quality | Data cleaning, handling missing values, data integration |
| Statistical assumptions | Outlier treatment, log transformation, Box-Cox |
| Algorithm requirements | Feature scaling, categorical encoding |
| Complexity reduction | Feature selection, PCA, t-SNE |
| Class imbalance | SMOTE, undersampling, class weights |
| Reproducibility | Version-controlled scripts, documented decisions, audit trails |
| Distance-based methods | Standardization, normalization, one-hot encoding |
| Interpretability | Feature selection, filter methods, original feature retention |
You're building a k-means clustering model and notice one feature ranges from 0-1 while another ranges from 0-100,000. Which preprocessing technique is essential here, and would you choose standardization or normalization? Why?
Compare and contrast how you would handle a dataset with 5% missing values versus one with 40% missing values. What factors influence your choice of imputation vs. deletion?
A collaborator sends you a dataset where they removed all outliers but didn't document which observations were removed or why. What reproducibility problems does this create, and how should this have been handled?
You have a categorical variable "country" with 195 unique values. Compare one-hot encoding vs. label encoding for this feature—which would you choose and what problems might each approach cause?
Your binary classification target has 95% negative cases and 5% positive cases. If an FRQ asks you to preprocess this data for a logistic regression model, which techniques would you apply and in what order?