Fiveable

🏭Intro to Industrial Engineering Unit 15 Review

QR code for Intro to Industrial Engineering practice questions

15.1 Data Collection and Preprocessing

15.1 Data Collection and Preprocessing

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🏭Intro to Industrial Engineering
Unit & Topic Study Guides

Data collection and preprocessing form the backbone of any data-driven decision in industrial engineering. Before you can optimize a production line, forecast demand, or improve quality, you need data that's accurate, consistent, and properly formatted. This unit covers how industrial engineers gather data, clean it up, combine it from different sources, and watch out for biases that can quietly wreck an analysis.

Data Sources in Industrial Engineering

Primary and Secondary Data Sources

Primary data is information you collect yourself, directly from the process or environment you're studying. Secondary data comes from sources that already exist, collected by someone else for a different purpose.

Common primary sources in industrial engineering:

  • Production logs and quality control reports
  • Sensor data from equipment on the factory floor
  • Employee time tracking systems
  • Supply chain management databases
  • Direct observation, interviews, surveys, and designed experiments

Two classic primary collection methods deserve special attention:

  • Time study involves directly measuring how long specific tasks take, then using those measurements to set standard times for operations.
  • Work sampling takes random observations throughout the day to estimate what proportion of time workers spend on various activities. It's less precise for individual tasks but covers a broader picture with less effort.

Common secondary sources include industry reports, government databases (like Bureau of Labor Statistics data), academic research, and historical company records.

Automated Data Collection Systems

Modern industrial facilities rely heavily on automated systems for real-time data gathering:

  • RFID tags track inventory movement and asset locations without requiring line-of-sight scanning.
  • Barcode scanners capture product information and transaction data quickly at specific checkpoints.
  • IoT devices continuously monitor equipment performance, environmental conditions (temperature, humidity, vibration), and energy usage.

These automated systems improve accuracy, reduce human error, and enable continuous monitoring rather than periodic snapshots. They also generate large volumes of data, which is where big data analytics comes in. Industrial engineers increasingly pull from diverse sources like customer feedback, social media sentiment, and market trend data to inform strategic decisions alongside traditional operational data.

Data Source Selection Factors

Choosing the right data source depends on several considerations:

  • Research question: What exactly are you trying to answer? This narrows your options fast.
  • Available resources and time constraints: Primary data collection is more tailored but costs more time and money than pulling existing secondary data.
  • Data quality and reliability requirements: A rough estimate might be fine for initial scoping, but a process capability study demands high-precision measurements.
  • Ethical and privacy considerations: Collecting employee performance data, for instance, requires careful attention to privacy regulations and company policy.
  • Scalability and integration: Can the collection system grow with your needs and feed into existing databases?

Data Cleaning and Preprocessing

Raw data almost always has problems. Data cleaning is the process of identifying and fixing errors, inconsistencies, and gaps before you run any analysis.

Error Identification and Correction

Errors generally fall into two categories:

  • Syntax errors: Incorrect formatting, invalid characters, or data entered in the wrong field (e.g., a date stored as "13/32/2024").
  • Semantic errors: Values that are technically valid but don't make sense in context (e.g., a machine cycle time of -5 seconds, or a temperature reading of 900°F for a room thermostat).

Handling missing data is one of the most common preprocessing tasks. Three main approaches:

  1. Mean or median imputation: Replace missing values with the average or median of that variable. Simple and fast, but it reduces variability in your dataset.
  2. Multiple imputation: Generate several plausible replacement values based on statistical models, then combine results. More sophisticated and preserves uncertainty.
  3. Listwise deletion: Remove entire records that have missing values. Only works well if the missing data is a small fraction of your dataset and is missing randomly.

Outlier detection catches anomalous data points that could distort your results. Common methods include:

  • Z-score method: Flag points more than 2 or 3 standard deviations from the mean.
  • Interquartile range (IQR) method: Flag points that fall below Q11.5×IQRQ1 - 1.5 \times IQR or above Q3+1.5×IQRQ3 + 1.5 \times IQR.
  • Machine learning approaches: Techniques like clustering or isolation forests can catch outliers in more complex, multi-variable datasets.

Not every outlier is an error. A sudden spike in defect rate might be a real event worth investigating, not a data mistake. Always check before deleting.

Primary and Secondary Data Sources, Automatic Surveillance and Control System Framework-DPS-KA-AT for Alleviating Disruptions of ...

Data Standardization and Formatting

When your variables are measured on different scales, you need to bring them to a common footing before analysis:

  • Min-max scaling transforms values to a fixed range, typically 0 to 1, using xscaled=xxminxmaxxminx_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}.
  • Z-score standardization centers data around a mean of 0 with a standard deviation of 1, using z=xμσz = \frac{x - \mu}{\sigma}.

Other formatting tasks include:

  • Data type conversion: Turning categorical text values into numerical codes (e.g., encoding "Pass/Fail" as 1/0) so they can be used in quantitative models.
  • Date standardization: Making sure all timestamps follow the same format across datasets.
  • Deduplication: Removing redundant records. Exact matching catches identical entries, while fuzzy matching catches near-duplicates (e.g., "Acme Corp" vs. "Acme Corporation").

Quality Assurance and Validation

Before moving forward with clean data, validate it:

  • Range checks ensure values fall within expected limits (e.g., a machine's operating temperature should be between specified bounds).
  • Cross-field validation verifies logical relationships between variables (e.g., a shipment's arrival date should never be before its dispatch date).
  • Data profiling summarizes the characteristics of your dataset, such as distributions, missing value counts, and unique value counts, to spot potential issues at a glance.
  • Data auditing verifies accuracy and completeness against known benchmarks or source records.

Document every cleaning step you take. This makes your work reproducible and lets others understand what transformations were applied.

Data Integration and Transformation

Data Integration Techniques

Industrial engineers rarely work with a single data source. Combining data from multiple systems enables more comprehensive analysis. For example:

  • Merging production data with quality control reports lets you trace defects back to specific process conditions.
  • Integrating supply chain data with sales records supports more accurate demand forecasting.

The integration process typically involves:

  1. Data mapping: Aligning fields from different sources so they correspond correctly (e.g., making sure "Part_ID" in one system matches "Component_Number" in another).
  2. Entity resolution: Identifying and linking records that refer to the same real-world entity across datasets, even when identifiers differ.
  3. Schema integration: Creating a unified structure that accommodates the combined data without losing information from either source.

Data Transformation Methods

Raw data often needs to be reshaped before it's useful for analysis:

  • Aggregation: Rolling up granular data to a higher level, like converting hourly production counts into daily or weekly totals.
  • KPI derivation: Calculating key performance indicators from raw data, such as overall equipment effectiveness (OEE) from availability, performance, and quality metrics.
  • Feature engineering: Creating new variables that capture domain knowledge. For example, deriving cycle time by subtracting a start timestamp from an end timestamp, or creating a ratio of defects per unit produced.

Transformation also applies to unstructured data:

  • Text mining can extract useful patterns from maintenance logs or incident reports.
  • Image processing can analyze quality control photographs for defect detection.
Primary and Secondary Data Sources, Journal of Engineering and Technology Research - life cycle assessment of cassava flour ...

Benefits of Integration and Transformation

  • A single source of truth reduces conflicting information across departments. When production, quality, and logistics teams all reference the same integrated dataset, decisions are more consistent.
  • Integrated data is more accessible and usable, cutting down the time engineers spend hunting for and reconciling information.
  • Cross-referencing different sources can reveal data quality issues that aren't visible in any single dataset alone, like discrepancies between what a production system recorded and what a quality system logged.

Biases and Errors in Data Collection

Bias in data is sneaky. It can lead you to confident but wrong conclusions. Understanding common biases helps you design better data collection and interpret results more carefully.

Types of Biases in Industrial Data

  • Selection bias happens when your sample doesn't represent the full population. If you only study high-performing production lines to benchmark efficiency, you'll miss the factors causing problems on other lines. If you only survey day-shift workers about satisfaction, you're ignoring a whole segment of the workforce.
  • Measurement bias comes from systematic errors in how data is collected. Uncalibrated sensors give consistently inaccurate readings. Different operators using slightly different measurement techniques introduce inconsistency.
  • Reporting bias occurs when information is selectively shared or withheld. Near-miss safety incidents are commonly underreported. Self-reported productivity logs tend to be optimistic.

Temporal and Cognitive Biases

  • Survivorship bias means you're only looking at what "survived." Analyzing only successful product launches while ignoring discontinued ones gives a distorted picture of what drives success. Studying only long-standing suppliers ignores the ones that failed.
  • Temporal bias results from not accounting for time-dependent variation. Collecting maintenance data only during day shifts misses problems that occur at night. Ignoring seasonal demand fluctuations leads to inaccurate sales analysis.
  • Confirmation bias affects data interpretation. Analysts may unconsciously focus on evidence supporting a preferred hypothesis while dismissing contradictory findings. This is especially dangerous in process improvement studies where there's pressure to justify a particular approach.

Impact on Decision Making

Biased data leads directly to flawed decisions:

  • Misallocating maintenance resources because equipment failure data is incomplete
  • Managing inventory poorly because demand forecasts are based on skewed samples
  • Missing process improvement opportunities because time study data only captured certain conditions

To mitigate these risks:

  1. Design robust collection protocols with clear procedures that minimize subjective judgment.
  2. Use multiple data sources and methods (triangulation) so that bias in one source can be caught by another.
  3. Conduct sensitivity analyses to test how much your conclusions would change if the data contained certain biases.
  4. Train data collectors on consistent methods and the importance of complete, honest reporting.