🎲Data, Inference, and Decisions

Data Cleaning Procedures to Know

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data cleaning isn't just busywork before the "real" analysis begins—it's where you make critical decisions that directly affect your conclusions. Every choice you make about missing values, outliers, or inconsistent entries shapes the story your data tells. On exams, you're being tested on whether you understand why certain cleaning procedures exist and when to apply them, not just whether you can define them.

These procedures connect to core concepts like statistical validity, bias prevention, and reproducibility. Whether you're calculating confidence intervals, building regression models, or making inferences about populations, dirty data undermines everything downstream. Don't just memorize the techniques—know what problem each one solves and what can go wrong if you skip it or apply it incorrectly.

Handling Incomplete Information

Missing data isn't random noise—it often reflects systematic patterns that can bias your analysis if ignored. The mechanism behind missingness (MCAR, MAR, or MNAR) determines which solutions are statistically appropriate.

Handling Missing Data

Identify the missingness pattern first—random gaps require different treatment than systematic ones
Imputation methods (mean, median, regression, or multiple imputation) preserve sample size but can introduce bias if misapplied
Listwise deletion removes entire records and works only when data is missing completely at random (MCAR)

Addressing Data Entry Errors

Validation checks at entry (range limits, format requirements, dropdown menus) prevent errors before they enter your dataset
Regular audits catch systematic errors that validation misses, especially in manually entered data
Human error patterns are often predictable—transposed digits, default values left unchanged, copy-paste mistakes

Compare: Missing data vs. data entry errors—both create gaps in your dataset, but missing data may be structurally meaningful while entry errors are simply mistakes. FRQs may ask you to distinguish between data that should be imputed versus data that should be corrected.

Detecting and Managing Anomalies

Outliers and duplicates can dramatically skew your statistics, but removing them without justification is just as problematic as ignoring them. The key principle: anomalies require investigation, not automatic deletion.

Dealing with Outliers

Detection methods include $Z$ -scores (values beyond $\pm 3$ ) and IQR fencing (below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$ )
Context determines treatment—a valid extreme value should be retained; a measurement error should be removed or corrected
Document your rationale because outlier decisions directly affect means, standard deviations, and regression coefficients

Removing Duplicates

Unique identifiers (ID numbers, timestamps, composite keys) are the most reliable way to detect exact duplicates
Fuzzy duplicates occur when the same entity appears with slight variations—these require judgment calls
Duplicate retention artificially inflates sample size and can create false precision in your estimates

Compare: Outliers vs. duplicates—outliers are extreme but potentially valid observations, while duplicates are redundant records that should never remain. If an FRQ asks about threats to statistical validity, both are fair game but for different reasons.

Ensuring Consistency and Compatibility

Analysis tools expect uniform data structures. Inconsistent formats, mixed data types, and unstandardized entries create errors that may be silent—your code runs, but your results are wrong. Standardization is about making data comparable across observations.

Standardizing Data Formats

Date formats are notorious culprits—"01/02/2024" means different things in different locales
Numerical precision should match the measurement's actual accuracy (don't report $3.141592653$ if you measured $3.1$ )
Unit consistency prevents catastrophic errors—the Mars Climate Orbiter crashed due to a metric/imperial unit mismatch

Correcting Inconsistent Data

Spelling variations ("New York," "NY," "N.Y.," "new york") fragment your categories and skew frequency counts
Reference tables or lookup dictionaries standardize entries to canonical forms
Regular expressions and string-matching algorithms can automate detection of near-duplicates

Handling Data Type Conversions

Type mismatches cause analysis failures—a zip code stored as a number loses leading zeros
String-to-numeric conversion can silently produce $NaN$ values or truncate decimal places
Categorical vs. numeric treatment matters: a 5-point Likert scale isn't necessarily interval data

Compare: Format standardization vs. inconsistency correction—standardization addresses how data is recorded (dates, units), while inconsistency correction addresses what was recorded (variant spellings, naming conventions). Both prevent the same category from being counted multiple times.

Preparing Data for Modeling

Some cleaning procedures exist specifically to make data compatible with statistical and machine learning methods. These transformations change the data's scale or structure without changing its underlying information.

Normalizing and Scaling Data

Min-Max scaling transforms values to a $[0, 1]$ range using $\frac{x - x_{min}}{x_{max} - x_{min}}$
Z-score normalization converts to standard units: $\frac{x - \mu}{\sigma}$ , centering at $0$ with standard deviation $1$
Algorithm sensitivity varies—distance-based methods (k-NN, clustering) require scaling; tree-based methods typically don't

Handling Categorical Variables

One-hot encoding creates binary columns for each category, avoiding false ordinal relationships
Label encoding assigns integers to categories—appropriate only when order is meaningful
High-cardinality variables (hundreds of categories) may need grouping or embedding techniques to remain tractable

Compare: Normalization vs. encoding—normalization adjusts the scale of continuous variables, while encoding converts categorical variables to numeric form. Both prepare data for algorithms, but they solve fundamentally different problems. Exam questions may ask which transformation is appropriate for which variable type.

Building Sustainable Data Quality

Data cleaning isn't a one-time task—it's an ongoing process that requires systems and accountability. Quality frameworks treat data as an organizational asset requiring continuous stewardship.

Addressing Data Quality Issues

Quality dimensions include accuracy, completeness, consistency, timeliness, and validity—each requires different monitoring approaches
Data profiling generates summary statistics that reveal quality issues before they affect analysis
Stakeholder engagement ensures that people closest to data collection understand quality standards and their importance

Compare: Reactive cleaning vs. proactive quality management—cleaning fixes problems after they occur, while quality frameworks prevent problems through validation, training, and monitoring. Strong FRQ responses demonstrate understanding of both approaches.

Quick Reference Table

Concept	Best Examples
Handling incomplete data	Missing data imputation, data entry error correction
Anomaly management	Outlier detection/treatment, duplicate removal
Format standardization	Date formats, unit consistency, numerical precision
Data consistency	Spelling corrections, naming conventions, reference tables
Type compatibility	Data type conversions, categorical encoding
Scale preparation	Min-Max scaling, Z-score normalization
Quality assurance	Data profiling, validation checks, quality frameworks

Self-Check Questions

You discover that 15% of income values are missing, and the missingness correlates with age. Which imputation approach would introduce the least bias, and why?
Compare Z-score detection and IQR fencing for identifying outliers. In what type of distribution would these methods give substantially different results?
A dataset contains "United States," "USA," "U.S.," and "US" in the country field. Which cleaning procedure addresses this, and what's the risk if you skip it?
When would you choose one-hot encoding over label encoding for a categorical variable? Give an example where using the wrong method would produce misleading results.
An FRQ asks you to evaluate a researcher's data cleaning decisions. They removed all outliers and imputed missing values with column means. What potential problems should you identify in your response?

🎲Data, Inference, and Decisions

Data Cleaning Procedures to Know

Why This Matters

Handling Incomplete Information

Handling Missing Data

Addressing Data Entry Errors

Detecting and Managing Anomalies

Dealing with Outliers

Removing Duplicates

Ensuring Consistency and Compatibility

Standardizing Data Formats

Correcting Inconsistent Data

Handling Data Type Conversions

Preparing Data for Modeling

Normalizing and Scaling Data

Handling Categorical Variables

Building Sustainable Data Quality

Addressing Data Quality Issues

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes