upgrade
upgrade

🎲Data, Inference, and Decisions

Data Cleaning Procedures to Know

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data cleaning isn't just busywork before the "real" analysis begins—it's where you make critical decisions that directly affect your conclusions. Every choice you make about missing values, outliers, or inconsistent entries shapes the story your data tells. On exams, you're being tested on whether you understand why certain cleaning procedures exist and when to apply them, not just whether you can define them.

These procedures connect to core concepts like statistical validity, bias prevention, and reproducibility. Whether you're calculating confidence intervals, building regression models, or making inferences about populations, dirty data undermines everything downstream. Don't just memorize the techniques—know what problem each one solves and what can go wrong if you skip it or apply it incorrectly.


Handling Incomplete Information

Missing data isn't random noise—it often reflects systematic patterns that can bias your analysis if ignored. The mechanism behind missingness (MCAR, MAR, or MNAR) determines which solutions are statistically appropriate.

Handling Missing Data

  • Identify the missingness pattern first—random gaps require different treatment than systematic ones
  • Imputation methods (mean, median, regression, or multiple imputation) preserve sample size but can introduce bias if misapplied
  • Listwise deletion removes entire records and works only when data is missing completely at random (MCAR)

Addressing Data Entry Errors

  • Validation checks at entry (range limits, format requirements, dropdown menus) prevent errors before they enter your dataset
  • Regular audits catch systematic errors that validation misses, especially in manually entered data
  • Human error patterns are often predictable—transposed digits, default values left unchanged, copy-paste mistakes

Compare: Missing data vs. data entry errors—both create gaps in your dataset, but missing data may be structurally meaningful while entry errors are simply mistakes. FRQs may ask you to distinguish between data that should be imputed versus data that should be corrected.


Detecting and Managing Anomalies

Outliers and duplicates can dramatically skew your statistics, but removing them without justification is just as problematic as ignoring them. The key principle: anomalies require investigation, not automatic deletion.

Dealing with Outliers

  • Detection methods include ZZ-scores (values beyond ±3\pm 3) and IQR fencing (below Q11.5×IQRQ_1 - 1.5 \times IQR or above Q3+1.5×IQRQ_3 + 1.5 \times IQR)
  • Context determines treatment—a valid extreme value should be retained; a measurement error should be removed or corrected
  • Document your rationale because outlier decisions directly affect means, standard deviations, and regression coefficients

Removing Duplicates

  • Unique identifiers (ID numbers, timestamps, composite keys) are the most reliable way to detect exact duplicates
  • Fuzzy duplicates occur when the same entity appears with slight variations—these require judgment calls
  • Duplicate retention artificially inflates sample size and can create false precision in your estimates

Compare: Outliers vs. duplicates—outliers are extreme but potentially valid observations, while duplicates are redundant records that should never remain. If an FRQ asks about threats to statistical validity, both are fair game but for different reasons.


Ensuring Consistency and Compatibility

Analysis tools expect uniform data structures. Inconsistent formats, mixed data types, and unstandardized entries create errors that may be silent—your code runs, but your results are wrong. Standardization is about making data comparable across observations.

Standardizing Data Formats

  • Date formats are notorious culprits—"01/02/2024" means different things in different locales
  • Numerical precision should match the measurement's actual accuracy (don't report 3.1415926533.141592653 if you measured 3.13.1)
  • Unit consistency prevents catastrophic errors—the Mars Climate Orbiter crashed due to a metric/imperial unit mismatch

Correcting Inconsistent Data

  • Spelling variations ("New York," "NY," "N.Y.," "new york") fragment your categories and skew frequency counts
  • Reference tables or lookup dictionaries standardize entries to canonical forms
  • Regular expressions and string-matching algorithms can automate detection of near-duplicates

Handling Data Type Conversions

  • Type mismatches cause analysis failures—a zip code stored as a number loses leading zeros
  • String-to-numeric conversion can silently produce NaNNaN values or truncate decimal places
  • Categorical vs. numeric treatment matters: a 5-point Likert scale isn't necessarily interval data

Compare: Format standardization vs. inconsistency correction—standardization addresses how data is recorded (dates, units), while inconsistency correction addresses what was recorded (variant spellings, naming conventions). Both prevent the same category from being counted multiple times.


Preparing Data for Modeling

Some cleaning procedures exist specifically to make data compatible with statistical and machine learning methods. These transformations change the data's scale or structure without changing its underlying information.

Normalizing and Scaling Data

  • Min-Max scaling transforms values to a [0,1][0, 1] range using xxminxmaxxmin\frac{x - x_{min}}{x_{max} - x_{min}}
  • Z-score normalization converts to standard units: xμσ\frac{x - \mu}{\sigma}, centering at 00 with standard deviation 11
  • Algorithm sensitivity varies—distance-based methods (k-NN, clustering) require scaling; tree-based methods typically don't

Handling Categorical Variables

  • One-hot encoding creates binary columns for each category, avoiding false ordinal relationships
  • Label encoding assigns integers to categories—appropriate only when order is meaningful
  • High-cardinality variables (hundreds of categories) may need grouping or embedding techniques to remain tractable

Compare: Normalization vs. encoding—normalization adjusts the scale of continuous variables, while encoding converts categorical variables to numeric form. Both prepare data for algorithms, but they solve fundamentally different problems. Exam questions may ask which transformation is appropriate for which variable type.


Building Sustainable Data Quality

Data cleaning isn't a one-time task—it's an ongoing process that requires systems and accountability. Quality frameworks treat data as an organizational asset requiring continuous stewardship.

Addressing Data Quality Issues

  • Quality dimensions include accuracy, completeness, consistency, timeliness, and validity—each requires different monitoring approaches
  • Data profiling generates summary statistics that reveal quality issues before they affect analysis
  • Stakeholder engagement ensures that people closest to data collection understand quality standards and their importance

Compare: Reactive cleaning vs. proactive quality management—cleaning fixes problems after they occur, while quality frameworks prevent problems through validation, training, and monitoring. Strong FRQ responses demonstrate understanding of both approaches.


Quick Reference Table

ConceptBest Examples
Handling incomplete dataMissing data imputation, data entry error correction
Anomaly managementOutlier detection/treatment, duplicate removal
Format standardizationDate formats, unit consistency, numerical precision
Data consistencySpelling corrections, naming conventions, reference tables
Type compatibilityData type conversions, categorical encoding
Scale preparationMin-Max scaling, Z-score normalization
Quality assuranceData profiling, validation checks, quality frameworks

Self-Check Questions

  1. You discover that 15% of income values are missing, and the missingness correlates with age. Which imputation approach would introduce the least bias, and why?

  2. Compare Z-score detection and IQR fencing for identifying outliers. In what type of distribution would these methods give substantially different results?

  3. A dataset contains "United States," "USA," "U.S.," and "US" in the country field. Which cleaning procedure addresses this, and what's the risk if you skip it?

  4. When would you choose one-hot encoding over label encoding for a categorical variable? Give an example where using the wrong method would produce misleading results.

  5. An FRQ asks you to evaluate a researcher's data cleaning decisions. They removed all outliers and imputed missing values with column means. What potential problems should you identify in your response?