Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Data cleaning isn't just busywork—it's the foundation that determines whether your entire analysis is trustworthy or garbage. In this course, you're being tested on your ability to recognize why certain cleaning methods exist, when to apply them, and how your choices affect downstream modeling and inference. The methods you'll learn here connect directly to concepts like bias-variance tradeoffs, model assumptions, and the reproducibility of your analytical pipeline.
Think of data cleaning as making decisions under uncertainty. Every choice—whether to impute a missing value, remove an outlier, or encode a categorical variable—introduces assumptions into your analysis. The exam will push you to justify these decisions, not just execute them mechanically. Don't just memorize the techniques—know what problem each method solves and what tradeoffs it introduces.
Before any analysis can begin, you need complete, unique records. Missing values and duplicates distort summary statistics, bias model training, and can cause code to fail entirely. These are often the first issues you'll encounter in any real dataset.
df.isnull().sum() or visualization tools like missingno to understand the pattern and extent of missingnessdf.drop_duplicates() with the subset parameter to specify which columns determine uniquenessCompare: Missing values vs. duplicates—both corrupt your dataset, but missing values reduce information while duplicates artificially inflate it. For FRQs asking about data quality, identify which problem you're solving and why your approach is appropriate.
Outliers can represent errors, rare events, or genuinely extreme values. The statistical methods you use for detection depend on your assumptions about the underlying distribution. Your handling decision should be context-driven, not automatic.
Compare: Z-score vs. IQR detection—Z-scores assume approximately normal data and are sensitive to the very outliers you're trying to detect, while IQR is distribution-free and more robust. If an FRQ gives you skewed data, IQR is usually the safer choice.
Computers are literal—they can't add a string "5" to an integer 3 or compare dates stored as text. Proper data types enable correct operations and often dramatically improve memory efficiency and computation speed.
3.7 to integer yields 3; converting "N/A" strings to numeric may produce unexpected NaN valuesfloat64 to float32 or integers to smaller types can halve memory usage on large datasetspd.to_datetime() for consistent parsing and manipulationCompare: String dates vs. datetime objects—strings allow storage but prevent arithmetic (calculating days between events), sorting correctly, or extracting components. Always convert dates to proper datetime types before analysis.
Many algorithms assume features are on comparable scales. Gradient-based methods converge faster with scaled data, and distance-based algorithms like k-NN can be dominated by high-magnitude features. Choosing between standardization and normalization depends on your data and model.
Compare: Standardization vs. normalization—standardization preserves outlier information (outliers become large Z-scores) while normalization compresses everything to a fixed range. Choose standardization for algorithms assuming Gaussian inputs; choose normalization when bounded outputs matter.
Non-numeric data requires special handling before most algorithms can use it. The encoding choices you make affect model interpretability, dimensionality, and the relationships your model can learn.
.strip()), removing punctuation, and converting to consistent case—inconsistent text creates false distinctionsCompare: One-hot vs. ordinal encoding—one-hot treats categories as unrelated (no implied ordering) while ordinal encoding imposes a numeric relationship. Using ordinal encoding for nominal categories (like colors) introduces false assumptions your model will learn incorrectly.
Even after individual cleaning steps, data can contain logical inconsistencies and integrity violations. Validation rules act as guardrails that catch errors before they propagate through your analysis.
Compare: Cleaning vs. validation—cleaning fixes known issues while validation catches unexpected ones. A robust pipeline does both: clean what you anticipate, validate to catch what you didn't.
| Concept | Best Examples |
|---|---|
| Information loss tradeoffs | Missing value deletion, outlier removal, float-to-int conversion |
| Distribution assumptions | Z-score outlier detection, standardization, mean imputation |
| Encoding decisions | One-hot encoding, ordinal encoding, rare category grouping |
| Scale sensitivity | Standardization, min-max normalization, feature scaling for gradient descent |
| Format standardization | Date parsing, text case normalization, categorical value mapping |
| Data leakage prevention | Fit scalers on training only, validate before splitting |
| Robustness to outliers | IQR method, median imputation, winsorizing |
You have a feature with 15% missing values that appear to be missing not at random (patients with severe symptoms skipped certain survey questions). Why might mean imputation be problematic here, and what alternative would you consider?
Compare Z-score standardization and min-max normalization: which would you choose for a dataset with significant outliers that you want to preserve, and why?
A categorical feature has 500 unique values, with 450 of them appearing fewer than 10 times each. What problem does this create for one-hot encoding, and how would you address it?
You're building a model and need to scale your features. Explain why you should fit your scaler on the training data only and transform both training and test sets with those parameters.
Your dataset contains a "country" column with entries like "USA", "U.S.", "United States", and "united states". Describe the cleaning steps you would take and explain why this inconsistency matters for analysis.