Data Analysis for Simulation Models
Input analysis and model validation sit at the front end of any simulation project. Before you can trust a simulation's output, you need to make sure the inputs accurately reflect the real system and that the model behaves the way that system actually does. This topic covers how to collect and clean data, fit probability distributions, validate your model, and use sensitivity analysis to understand which inputs matter most.
Input Data Collection and Preprocessing
The quality of your simulation depends directly on the quality of your input data. Garbage in, garbage out.
Statistical techniques for analyzing input data:
- Descriptive statistics summarize your data's center, spread, and shape (mean, standard deviation, skewness)
- Time series analysis looks for trends, cycles, or seasonal patterns in data collected over time
- Probability distribution fitting matches your observed data to a theoretical distribution you can use in the model
Data cleaning and preprocessing catch problems before they corrupt your model:
- Remove duplicate entries
- Handle missing values (imputation, deletion, or flagging)
- Correct formatting inconsistencies (e.g., mixed date formats, unit mismatches)
Sampling methods help you collect representative data when you can't observe every event:
- Simple random sampling gives every data point an equal chance of being selected
- Stratified sampling divides the population into subgroups (strata) first, then samples from each. This is useful when subgroups behave differently, like day-shift vs. night-shift workers.
- Cluster sampling selects entire groups (clusters) rather than individuals, which can be more practical when data is naturally grouped by location or time period
Data Analysis Techniques
Correlation analysis identifies relationships between input variables, which matters because correlated inputs need to be modeled together rather than independently.
- Pearson correlation coefficient measures the strength of linear relationships between two variables
- Spearman rank correlation measures monotonic relationships (one variable consistently increases as the other does, even if not in a straight line)
Outlier detection protects your data integrity. A single extreme value can distort a fitted distribution.
- The z-score method flags data points that fall more than a set number of standard deviations from the mean (commonly 2 or 3)
- The interquartile range (IQR) method flags points below or above
Data visualization helps you spot patterns and problems that summary statistics alone can miss:
- Histograms show the shape of a frequency distribution, helping you guess which theoretical distribution might fit
- Scatter plots reveal relationships between two variables (e.g., temperature vs. energy consumption)
- Box plots summarize the median, spread, and outliers in a dataset at a glance
Probability Distributions for Input Data

Distribution Fitting Process
Distribution fitting is the process of selecting a theoretical probability distribution that best represents your observed data. You need this because your simulation will generate random variates from these distributions during each run.
Common distributions in simulation modeling:
- Normal distribution models symmetric, bell-shaped data (e.g., measurement errors, human heights)
- Exponential distribution models the time between independent events (e.g., time between customer arrivals at a service counter)
- Poisson distribution models the count of events in a fixed interval (e.g., number of defects per batch in manufacturing)
- Weibull distribution is flexible for modeling failure times and reliability data (e.g., time until a machine breaks down). Its shape parameter lets it represent increasing, decreasing, or constant failure rates.
Parameter estimation determines the specific values that define your chosen distribution:
- Maximum Likelihood Estimation (MLE) finds parameter values that make the observed data most probable under the assumed distribution. It's the most widely used method.
- Method of Moments sets the sample moments (mean, variance, etc.) equal to the theoretical moments and solves for the parameters. It's simpler but generally less efficient than MLE.
Goodness-of-Fit Assessment
After fitting a distribution, you need to test whether the fit is actually good enough. Don't just assume it works.
Formal statistical tests:
- Chi-square test bins the data and compares observed vs. expected frequencies in each bin
- Kolmogorov-Smirnov (K-S) test measures the maximum vertical distance between the empirical CDF and the theoretical CDF. It works on continuous data without binning.
- Anderson-Darling test is similar to K-S but gives more weight to the tails of the distribution, which is useful when tail behavior matters for your simulation
Visual methods complement the formal tests and often reveal problems the tests miss:
- Q-Q plots plot sample quantiles against theoretical quantiles. If the fit is good, points fall close to a straight diagonal line. Deviations at the ends indicate poor tail fit.
- P-P plots compare cumulative probabilities and are more sensitive to differences in the middle of the distribution
- ECDF plots overlay the empirical CDF on the theoretical CDF so you can visually inspect the gap
Model selection criteria help you choose between competing distributions that all pass goodness-of-fit tests:
- Akaike Information Criterion (AIC) balances model fit against complexity. Lower AIC is better.
- Bayesian Information Criterion (BIC) also balances fit and complexity but penalizes extra parameters more heavily than AIC, favoring simpler models.
Software tools that handle distribution fitting include Arena's Input Analyzer, @RISK, and R's fitdistrplus package.
Model Validation and Comparison

Validation Techniques
Model validation answers a critical question: does your simulation accurately represent the real-world system? A model that runs without errors isn't necessarily a valid model.
Face validation is the simplest starting point. Subject matter experts review the model's logic, assumptions, structure, inputs, and outputs to check whether they seem reasonable. This catches obvious errors early.
Statistical validation uses formal comparisons between simulation output and real-world data:
- Hypothesis tests (t-tests, ANOVA) check whether the difference between simulated and real output is statistically significant
- Confidence intervals estimate the range within which the true parameter value likely falls, giving you a sense of how precise your model's predictions are
Historical data validation tests predictive accuracy by running the model against past data the model wasn't built on. Compare predictions to known outcomes across different time periods to see if the model generalizes well.
Operational Validation and Visualization
Operational validation tests whether the model performs well enough for its intended purpose. This goes beyond statistical accuracy.
- Run pilot simulations to verify the model behaves as expected under normal and extreme conditions
- Check whether the model meets specific performance criteria defined by stakeholders
Animation and visualization are surprisingly effective validation tools. Animating an assembly line simulation, for example, lets you visually spot illogical behavior (parts moving backward, queues forming in impossible places). Dynamic charts tracking metrics like queue lengths or resource utilization over time can reveal transient problems that summary statistics would hide.
Sensitivity analysis also serves a validation role here. If the model's output changes dramatically in response to a small change in an input you know is stable in reality, that's a red flag worth investigating.
Sensitivity Analysis for Input Parameters
Local and Global Sensitivity Analysis
Sensitivity analysis tells you which input parameters have the biggest effect on your model's output. This helps you focus data collection efforts on the inputs that actually matter.
Local sensitivity analysis examines one parameter at a time:
- Hold all parameters at their baseline values
- Vary a single parameter by a small amount
- Measure how much the output changes
- Repeat for each parameter
This approach is straightforward, but it only captures the effect of each parameter around one specific point and can miss interactions between parameters.
Global sensitivity analysis varies all parameters simultaneously across their full ranges, capturing interactions that local methods miss. Key methods include:
- Sobol indices decompose the total output variance into contributions from each parameter (first-order effects) and from parameter interactions (higher-order effects). The total-effect index for a parameter includes both its direct effect and all its interactions.
- FAST (Fourier Amplitude Sensitivity Test) uses frequency-based decomposition to achieve similar results, often with fewer simulation runs than Sobol.
Sensitivity Analysis Techniques
One-at-a-time (OAT) analysis is the simplest approach. You vary each parameter individually while holding everything else constant. It's a good starting point for screening, but it won't detect interactions between parameters.
Morris method (elementary effects) is a more efficient screening technique:
- Randomly sample points across the full parameter space
- Compute the "elementary effect" of each parameter at each sampled point
- Parameters with a high mean elementary effect are influential; parameters with a high standard deviation have effects that depend on the values of other parameters (indicating interactions)
Metamodeling techniques build simplified surrogate models when running the full simulation is too expensive:
- Response surface methodology fits polynomial equations (often quadratic) to simulation output, letting you explore the input-output relationship quickly
- Gaussian process emulation (kriging) builds a statistical surrogate that interpolates between simulation runs and provides uncertainty estimates
Graphical tools make results easier to interpret:
- Tornado diagrams rank parameters by their impact on the output, with the most influential parameter at the top. Each bar shows the output range when that parameter varies across its range.
- Spider plots show how the output changes as each parameter moves across its range, with all curves plotted on the same axes for comparison