Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Data cleaning isn't just a preliminary chore—it's where reproducible science lives or dies. When you're working collaboratively on statistical analyses, messy data creates cascading problems: your teammate can't replicate your results, your models behave unpredictably, and your conclusions become suspect. The techniques in this guide address fundamental challenges in data integrity, consistency, and preparation that determine whether your analysis pipeline produces trustworthy, reproducible outputs.
You're being tested on your ability to recognize when each technique applies and why it matters for downstream analysis. A missing value strategy that works for one dataset might introduce bias in another. Scaling that improves one model could destroy interpretability in another. Don't just memorize the functions—know what problem each technique solves and what tradeoffs you're accepting when you apply it.
Missing data is rarely random. Understanding the mechanism behind missingness—whether it's Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR)—determines which strategy will preserve the integrity of your analysis.
df.isnull().sum() or heatmaps to reveal missingness structure across variablesCompare: Simple deletion vs. imputation—both handle missing data, but deletion reduces sample size while imputation preserves it at the cost of introducing assumptions. If your data is MCAR and missingness is below 5%, deletion is often defensible; otherwise, imputation with documented assumptions is more reproducible.
Inconsistent data is a collaboration killer. When one team member formats dates as "MM/DD/YYYY" and another uses "YYYY-MM-DD," merge operations fail silently and analyses diverge. Standardization creates a shared language for your data.
str.lower(), str.strip(), and regex patterns enforce uniformityCompare: Formatting fixes vs. duplicate removal—both improve consistency, but formatting preserves all records while deduplication reduces them. Always deduplicate after standardizing formats, or you'll miss duplicates that differ only in formatting.
The wrong data type doesn't just cause errors—it causes silent errors. A year stored as a float might sort correctly but produce nonsense when you calculate differences. Type discipline is foundational to reproducible analysis.
astype() ensures columns behave as expected in calculations and joinsassert df['year'].dtype == 'int64'Compare: Inner join vs. left join—both combine datasets, but inner joins keep only matches while left joins preserve all records from the primary table. For reproducibility, document which join type you used and why unmatched records were acceptable to lose (or keep as nulls).
Outliers aren't inherently bad—they might be your most interesting data points or they might be errors. The key is understanding their source before deciding how to handle them.
Compare: Z-score vs. IQR methods—both detect outliers, but Z-scores assume normality while IQR works for any distribution. Use IQR for skewed data; use Z-scores when you've verified approximate normality.
Many algorithms assume features are on comparable scales. Without scaling, a feature measured in millions will dominate one measured in decimals—not because it's more important, but because it's bigger.
Compare: Standardization vs. normalization—both rescale features, but standardization centers around zero while normalization bounds to a range. Use standardization for most regression and clustering; use normalization for neural networks or when you need strict bounds.
Statistical models need numbers. Converting categories and text into numerical representations requires careful choices that affect both model performance and interpretability.
Compare: One-hot vs. label encoding—both convert categories to numbers, but one-hot treats categories as independent while label encoding implies order. Use label encoding only for truly ordinal variables (like education level); use one-hot for nominal categories (like country).
Some data problems aren't about individual values—they're about relationships between features or imbalances in your target variable that can mislead your entire analysis.
Compare: Oversampling vs. undersampling—both address imbalance, but oversampling preserves all data while undersampling discards majority examples. Use oversampling when data is scarce; use undersampling when the majority class is very large and you can afford to lose examples.
| Concept | Best Techniques |
|---|---|
| Missing Data | Null checks, mean/median imputation, multiple imputation, predictive imputation |
| Consistency | Format standardization, spelling correction, duplicate removal |
| Data Types | Explicit casting, datetime conversion, merge validation |
| Outliers | Z-score detection, IQR method, transformation, robust statistics |
| Scaling | Standardization (Z-score), Min-Max normalization, fit-transform separation |
| Categorical Encoding | One-hot encoding, label encoding, target encoding |
| Text Preparation | Case normalization, tokenization, stop word removal |
| Structural Issues | VIF for multicollinearity, SMOTE for imbalance, appropriate metrics |
You have a dataset where 15% of income values are missing, and missingness correlates with education level. Which imputation strategy is most appropriate, and why would simple deletion bias your results?
Compare standardization and Min-Max normalization: which would you choose for a -means clustering task, and which for a neural network with sigmoid activations? Explain your reasoning.
Your merged dataset has more rows than either source table. What type of join likely caused this, and what validation step should you add to your pipeline?
A colleague encoded a "country" variable using label encoding (USA=1, Canada=2, Mexico=3) before training a linear regression. What problem does this create, and how would you fix it?
You're building a fraud detection model where only 0.5% of transactions are fraudulent. If your model achieves 99.5% accuracy, why might this be meaningless? What metrics and data cleaning techniques should you use instead?