📊Principles of Data Science

Data Cleaning Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data cleaning isn't just busywork—it's the foundation that determines whether your entire analysis is trustworthy or garbage. In this course, you're being tested on your ability to recognize why certain cleaning methods exist, when to apply them, and how your choices affect downstream modeling and inference. The methods you'll learn here connect directly to concepts like bias-variance tradeoffs, model assumptions, and the reproducibility of your analytical pipeline.

Think of data cleaning as making decisions under uncertainty. Every choice—whether to impute a missing value, remove an outlier, or encode a categorical variable—introduces assumptions into your analysis. The exam will push you to justify these decisions, not just execute them mechanically. Don't just memorize the techniques—know what problem each method solves and what tradeoffs it introduces.

Handling Missing and Duplicate Data

Before any analysis can begin, you need complete, unique records. Missing values and duplicates distort summary statistics, bias model training, and can cause code to fail entirely. These are often the first issues you'll encounter in any real dataset.

Handling Missing Values

Identification comes first—use methods like df.isnull().sum() or visualization tools like missingno to understand the pattern and extent of missingness
Imputation strategies include mean, median, mode, or predictive models; the right choice depends on whether data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)
Deletion vs. imputation tradeoff—removing rows loses information and can introduce bias, while imputation adds assumptions about the underlying distribution

Removing Duplicates

Define your duplicate criteria carefully—exact duplicates across all columns differ from duplicates on key identifiers only
Use built-in functions like df.drop_duplicates() with the subset parameter to specify which columns determine uniqueness
Assess downstream impact—duplicates inflate counts, skew distributions, and can cause data leakage if the same observation appears in both training and test sets

Compare: Missing values vs. duplicates—both corrupt your dataset, but missing values reduce information while duplicates artificially inflate it. For FRQs asking about data quality, identify which problem you're solving and why your approach is appropriate.

Detecting and Managing Outliers

Outliers can represent errors, rare events, or genuinely extreme values. The statistical methods you use for detection depend on your assumptions about the underlying distribution. Your handling decision should be context-driven, not automatic.

Dealing with Outliers

Detection methods include Z-scores (assumes normality, flags points beyond $\pm 3\sigma$ ), IQR method (robust to non-normality, flags points outside $Q_1 - 1.5 \times IQR$ to $Q_3 + 1.5 \times IQR$ ), and visualization
Context determines action—a $\$10$ million transaction might be an error in retail data but valid in real estate; domain knowledge matters more than statistical rules
Transformation options include winsorizing (capping extreme values), log transforms to reduce skew, or robust statistical methods that downweight outliers automatically

Compare: Z-score vs. IQR detection—Z-scores assume approximately normal data and are sensitive to the very outliers you're trying to detect, while IQR is distribution-free and more robust. If an FRQ gives you skewed data, IQR is usually the safer choice.

Type Conversion and Formatting

Computers are literal—they can't add a string "5" to an integer 3 or compare dates stored as text. Proper data types enable correct operations and often dramatically improve memory efficiency and computation speed.

Data Type Conversion

Match types to operations—numerical analysis requires numeric types; categorical grouping works best with proper category dtypes; dates need datetime objects
Watch for silent data loss—converting 3.7 to integer yields 3; converting "N/A" strings to numeric may produce unexpected NaN values
Memory optimization matters—downcasting floats from float64 to float32 or integers to smaller types can halve memory usage on large datasets

Date and Time Formatting

Standardize to ISO 8601 format ( $YYYY-MM-DD$ ) or convert to datetime objects using pd.to_datetime() for consistent parsing and manipulation
Time zone handling is critical—naive datetime objects (no timezone) can cause errors when comparing timestamps from different sources
Extract useful features—datetime objects enable extraction of day-of-week, month, hour, and other temporal features for analysis

Compare: String dates vs. datetime objects—strings allow storage but prevent arithmetic (calculating days between events), sorting correctly, or extracting components. Always convert dates to proper datetime types before analysis.

Scaling and Normalization

Many algorithms assume features are on comparable scales. Gradient-based methods converge faster with scaled data, and distance-based algorithms like k-NN can be dominated by high-magnitude features. Choosing between standardization and normalization depends on your data and model.

Standardization and Normalization

Standardization (Z-score scaling) transforms data to mean $\mu = 0$ and standard deviation $\sigma = 1$ using $z = \frac{x - \mu}{\sigma}$ ; preferred when data is approximately Gaussian
Min-max normalization rescales to a fixed range (typically $[0, 1]$ ) using $x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$ ; useful when you need bounded values or for neural networks
Fit on training data only—calculate scaling parameters from training set and apply to test set to prevent data leakage

Compare: Standardization vs. normalization—standardization preserves outlier information (outliers become large Z-scores) while normalization compresses everything to a fixed range. Choose standardization for algorithms assuming Gaussian inputs; choose normalization when bounded outputs matter.

Text and Categorical Data Processing

Non-numeric data requires special handling before most algorithms can use it. The encoding choices you make affect model interpretability, dimensionality, and the relationships your model can learn.

String Cleaning and Text Preprocessing

Basic cleaning includes stripping whitespace (.strip()), removing punctuation, and converting to consistent case—inconsistent text creates false distinctions
Tokenization splits text into units (words, characters, subwords) for analysis; stop word removal eliminates common words like "the" and "is" that add noise
Stemming vs. lemmatization—stemming chops word endings (running → run) while lemmatization uses vocabulary to find true base forms (better → good); lemmatization is more accurate but slower

Handling Categorical Variables

One-hot encoding creates binary columns for each category—necessary for most algorithms but increases dimensionality with high-cardinality features
Ordinal encoding assigns integers to categories—appropriate only when categories have meaningful order (low < medium < high)
Group rare categories into an "Other" bucket to reduce dimensionality and prevent overfitting to categories with few observations

Compare: One-hot vs. ordinal encoding—one-hot treats categories as unrelated (no implied ordering) while ordinal encoding imposes a numeric relationship. Using ordinal encoding for nominal categories (like colors) introduces false assumptions your model will learn incorrectly.

Consistency and Validation

Even after individual cleaning steps, data can contain logical inconsistencies and integrity violations. Validation rules act as guardrails that catch errors before they propagate through your analysis.

Handling Inconsistent Data

Standardize representations—"USA", "United States", "US", and "U.S.A." should map to a single canonical form using dictionaries or fuzzy matching
Establish data validation rules—ages should be positive, dates should be in valid ranges, categories should match allowed values
Regular auditing catches drift over time—data entry practices change, upstream systems evolve, and new inconsistencies emerge

Data Validation and Integrity Checks

Schema validation ensures data conforms to expected types, ranges, and formats before processing begins
Referential integrity verifies that relationships between tables are valid—foreign keys should reference existing primary keys
Automated pipelines should include validation checkpoints that flag anomalies and halt processing when critical rules are violated

Compare: Cleaning vs. validation—cleaning fixes known issues while validation catches unexpected ones. A robust pipeline does both: clean what you anticipate, validate to catch what you didn't.

Quick Reference Table

Concept	Best Examples
Information loss tradeoffs	Missing value deletion, outlier removal, float-to-int conversion
Distribution assumptions	Z-score outlier detection, standardization, mean imputation
Encoding decisions	One-hot encoding, ordinal encoding, rare category grouping
Scale sensitivity	Standardization, min-max normalization, feature scaling for gradient descent
Format standardization	Date parsing, text case normalization, categorical value mapping
Data leakage prevention	Fit scalers on training only, validate before splitting
Robustness to outliers	IQR method, median imputation, winsorizing

Self-Check Questions

You have a feature with 15% missing values that appear to be missing not at random (patients with severe symptoms skipped certain survey questions). Why might mean imputation be problematic here, and what alternative would you consider?
Compare Z-score standardization and min-max normalization: which would you choose for a dataset with significant outliers that you want to preserve, and why?
A categorical feature has 500 unique values, with 450 of them appearing fewer than 10 times each. What problem does this create for one-hot encoding, and how would you address it?
You're building a model and need to scale your features. Explain why you should fit your scaler on the training data only and transform both training and test sets with those parameters.
Your dataset contains a "country" column with entries like "USA", "U.S.", "United States", and "united states". Describe the cleaning steps you would take and explain why this inconsistency matters for analysis.

📊Principles of Data Science

Data Cleaning Methods

Why This Matters

Handling Missing and Duplicate Data

Handling Missing Values

Removing Duplicates

Detecting and Managing Outliers

Dealing with Outliers

Type Conversion and Formatting

Data Type Conversion

Date and Time Formatting

Scaling and Normalization

Standardization and Normalization

Text and Categorical Data Processing

String Cleaning and Text Preprocessing

Handling Categorical Variables

Consistency and Validation

Handling Inconsistent Data

Data Validation and Integrity Checks

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes