upgrade
upgrade

📊Principles of Data Science

Data Cleaning Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data cleaning isn't just busywork—it's the foundation that determines whether your entire analysis is trustworthy or garbage. In this course, you're being tested on your ability to recognize why certain cleaning methods exist, when to apply them, and how your choices affect downstream modeling and inference. The methods you'll learn here connect directly to concepts like bias-variance tradeoffs, model assumptions, and the reproducibility of your analytical pipeline.

Think of data cleaning as making decisions under uncertainty. Every choice—whether to impute a missing value, remove an outlier, or encode a categorical variable—introduces assumptions into your analysis. The exam will push you to justify these decisions, not just execute them mechanically. Don't just memorize the techniques—know what problem each method solves and what tradeoffs it introduces.


Handling Missing and Duplicate Data

Before any analysis can begin, you need complete, unique records. Missing values and duplicates distort summary statistics, bias model training, and can cause code to fail entirely. These are often the first issues you'll encounter in any real dataset.

Handling Missing Values

  • Identification comes first—use methods like df.isnull().sum() or visualization tools like missingno to understand the pattern and extent of missingness
  • Imputation strategies include mean, median, mode, or predictive models; the right choice depends on whether data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)
  • Deletion vs. imputation tradeoff—removing rows loses information and can introduce bias, while imputation adds assumptions about the underlying distribution

Removing Duplicates

  • Define your duplicate criteria carefully—exact duplicates across all columns differ from duplicates on key identifiers only
  • Use built-in functions like df.drop_duplicates() with the subset parameter to specify which columns determine uniqueness
  • Assess downstream impact—duplicates inflate counts, skew distributions, and can cause data leakage if the same observation appears in both training and test sets

Compare: Missing values vs. duplicates—both corrupt your dataset, but missing values reduce information while duplicates artificially inflate it. For FRQs asking about data quality, identify which problem you're solving and why your approach is appropriate.


Detecting and Managing Outliers

Outliers can represent errors, rare events, or genuinely extreme values. The statistical methods you use for detection depend on your assumptions about the underlying distribution. Your handling decision should be context-driven, not automatic.

Dealing with Outliers

  • Detection methods include Z-scores (assumes normality, flags points beyond ±3σ\pm 3\sigma), IQR method (robust to non-normality, flags points outside Q11.5×IQRQ_1 - 1.5 \times IQR to Q3+1.5×IQRQ_3 + 1.5 \times IQR), and visualization
  • Context determines action—a $10\$10 million transaction might be an error in retail data but valid in real estate; domain knowledge matters more than statistical rules
  • Transformation options include winsorizing (capping extreme values), log transforms to reduce skew, or robust statistical methods that downweight outliers automatically

Compare: Z-score vs. IQR detection—Z-scores assume approximately normal data and are sensitive to the very outliers you're trying to detect, while IQR is distribution-free and more robust. If an FRQ gives you skewed data, IQR is usually the safer choice.


Type Conversion and Formatting

Computers are literal—they can't add a string "5" to an integer 3 or compare dates stored as text. Proper data types enable correct operations and often dramatically improve memory efficiency and computation speed.

Data Type Conversion

  • Match types to operations—numerical analysis requires numeric types; categorical grouping works best with proper category dtypes; dates need datetime objects
  • Watch for silent data loss—converting 3.7 to integer yields 3; converting "N/A" strings to numeric may produce unexpected NaN values
  • Memory optimization matters—downcasting floats from float64 to float32 or integers to smaller types can halve memory usage on large datasets

Date and Time Formatting

  • Standardize to ISO 8601 format (YYYYMMDDYYYY-MM-DD) or convert to datetime objects using pd.to_datetime() for consistent parsing and manipulation
  • Time zone handling is critical—naive datetime objects (no timezone) can cause errors when comparing timestamps from different sources
  • Extract useful features—datetime objects enable extraction of day-of-week, month, hour, and other temporal features for analysis

Compare: String dates vs. datetime objects—strings allow storage but prevent arithmetic (calculating days between events), sorting correctly, or extracting components. Always convert dates to proper datetime types before analysis.


Scaling and Normalization

Many algorithms assume features are on comparable scales. Gradient-based methods converge faster with scaled data, and distance-based algorithms like k-NN can be dominated by high-magnitude features. Choosing between standardization and normalization depends on your data and model.

Standardization and Normalization

  • Standardization (Z-score scaling) transforms data to mean μ=0\mu = 0 and standard deviation σ=1\sigma = 1 using z=xμσz = \frac{x - \mu}{\sigma}; preferred when data is approximately Gaussian
  • Min-max normalization rescales to a fixed range (typically [0,1][0, 1]) using xnorm=xxminxmaxxminx_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}; useful when you need bounded values or for neural networks
  • Fit on training data only—calculate scaling parameters from training set and apply to test set to prevent data leakage

Compare: Standardization vs. normalization—standardization preserves outlier information (outliers become large Z-scores) while normalization compresses everything to a fixed range. Choose standardization for algorithms assuming Gaussian inputs; choose normalization when bounded outputs matter.


Text and Categorical Data Processing

Non-numeric data requires special handling before most algorithms can use it. The encoding choices you make affect model interpretability, dimensionality, and the relationships your model can learn.

String Cleaning and Text Preprocessing

  • Basic cleaning includes stripping whitespace (.strip()), removing punctuation, and converting to consistent case—inconsistent text creates false distinctions
  • Tokenization splits text into units (words, characters, subwords) for analysis; stop word removal eliminates common words like "the" and "is" that add noise
  • Stemming vs. lemmatization—stemming chops word endings (runningrun) while lemmatization uses vocabulary to find true base forms (bettergood); lemmatization is more accurate but slower

Handling Categorical Variables

  • One-hot encoding creates binary columns for each category—necessary for most algorithms but increases dimensionality with high-cardinality features
  • Ordinal encoding assigns integers to categories—appropriate only when categories have meaningful order (low < medium < high)
  • Group rare categories into an "Other" bucket to reduce dimensionality and prevent overfitting to categories with few observations

Compare: One-hot vs. ordinal encoding—one-hot treats categories as unrelated (no implied ordering) while ordinal encoding imposes a numeric relationship. Using ordinal encoding for nominal categories (like colors) introduces false assumptions your model will learn incorrectly.


Consistency and Validation

Even after individual cleaning steps, data can contain logical inconsistencies and integrity violations. Validation rules act as guardrails that catch errors before they propagate through your analysis.

Handling Inconsistent Data

  • Standardize representations—"USA", "United States", "US", and "U.S.A." should map to a single canonical form using dictionaries or fuzzy matching
  • Establish data validation rules—ages should be positive, dates should be in valid ranges, categories should match allowed values
  • Regular auditing catches drift over time—data entry practices change, upstream systems evolve, and new inconsistencies emerge

Data Validation and Integrity Checks

  • Schema validation ensures data conforms to expected types, ranges, and formats before processing begins
  • Referential integrity verifies that relationships between tables are valid—foreign keys should reference existing primary keys
  • Automated pipelines should include validation checkpoints that flag anomalies and halt processing when critical rules are violated

Compare: Cleaning vs. validation—cleaning fixes known issues while validation catches unexpected ones. A robust pipeline does both: clean what you anticipate, validate to catch what you didn't.


Quick Reference Table

ConceptBest Examples
Information loss tradeoffsMissing value deletion, outlier removal, float-to-int conversion
Distribution assumptionsZ-score outlier detection, standardization, mean imputation
Encoding decisionsOne-hot encoding, ordinal encoding, rare category grouping
Scale sensitivityStandardization, min-max normalization, feature scaling for gradient descent
Format standardizationDate parsing, text case normalization, categorical value mapping
Data leakage preventionFit scalers on training only, validate before splitting
Robustness to outliersIQR method, median imputation, winsorizing

Self-Check Questions

  1. You have a feature with 15% missing values that appear to be missing not at random (patients with severe symptoms skipped certain survey questions). Why might mean imputation be problematic here, and what alternative would you consider?

  2. Compare Z-score standardization and min-max normalization: which would you choose for a dataset with significant outliers that you want to preserve, and why?

  3. A categorical feature has 500 unique values, with 450 of them appearing fewer than 10 times each. What problem does this create for one-hot encoding, and how would you address it?

  4. You're building a model and need to scale your features. Explain why you should fit your scaler on the training data only and transform both training and test sets with those parameters.

  5. Your dataset contains a "country" column with entries like "USA", "U.S.", "United States", and "united states". Describe the cleaning steps you would take and explain why this inconsistency matters for analysis.