Input analysis and model validation sit at the front end of any simulation project. Before you can trust a simulation's output, you need to make sure the inputs accurately reflect the real system and that the model behaves the way that system actually does. This topic covers how to collect and clean data, fit probability distributions, validate your model, and use sensitivity analysis to understand which inputs matter most.

Input Data Collection and Preprocessing

The quality of your simulation depends directly on the quality of your input data. Garbage in, garbage out.

Statistical techniques for analyzing input data:

Descriptive statistics summarize your data's center, spread, and shape (mean, standard deviation, skewness)
Time series analysis looks for trends, cycles, or seasonal patterns in data collected over time
Probability distribution fitting matches your observed data to a theoretical distribution you can use in the model

Data cleaning and preprocessing catch problems before they corrupt your model:

Remove duplicate entries
Handle missing values (imputation, deletion, or flagging)
Correct formatting inconsistencies (e.g., mixed date formats, unit mismatches)

Sampling methods help you collect representative data when you can't observe every event:

Simple random sampling gives every data point an equal chance of being selected
Stratified sampling divides the population into subgroups (strata) first, then samples from each. This is useful when subgroups behave differently, like day-shift vs. night-shift workers.
Cluster sampling selects entire groups (clusters) rather than individuals, which can be more practical when data is naturally grouped by location or time period

Data Analysis Techniques

Correlation analysis identifies relationships between input variables, which matters because correlated inputs need to be modeled together rather than independently.

Pearson correlation coefficient measures the strength of linear relationships between two variables
Spearman rank correlation measures monotonic relationships (one variable consistently increases as the other does, even if not in a straight line)

Outlier detection protects your data integrity. A single extreme value can distort a fitted distribution.

The z-score method flags data points that fall more than a set number of standard deviations from the mean (commonly 2 or 3)
The interquartile range (IQR) method flags points below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$

Data visualization helps you spot patterns and problems that summary statistics alone can miss:

Histograms show the shape of a frequency distribution, helping you guess which theoretical distribution might fit
Scatter plots reveal relationships between two variables (e.g., temperature vs. energy consumption)
Box plots summarize the median, spread, and outliers in a dataset at a glance

Probability Distributions for Input Data

Input Data Collection and Preprocessing, Why It Matters: Summarizing Data Graphically and Numerically | Concepts in Statistics

Distribution Fitting Process

Distribution fitting is the process of selecting a theoretical probability distribution that best represents your observed data. You need this because your simulation will generate random variates from these distributions during each run.

Common distributions in simulation modeling:

Normal distribution models symmetric, bell-shaped data (e.g., measurement errors, human heights)
Exponential distribution models the time between independent events (e.g., time between customer arrivals at a service counter)
Poisson distribution models the count of events in a fixed interval (e.g., number of defects per batch in manufacturing)
Weibull distribution is flexible for modeling failure times and reliability data (e.g., time until a machine breaks down). Its shape parameter lets it represent increasing, decreasing, or constant failure rates.

Parameter estimation determines the specific values that define your chosen distribution:

Maximum Likelihood Estimation (MLE) finds parameter values that make the observed data most probable under the assumed distribution. It's the most widely used method.
Method of Moments sets the sample moments (mean, variance, etc.) equal to the theoretical moments and solves for the parameters. It's simpler but generally less efficient than MLE.

Goodness-of-Fit Assessment

After fitting a distribution, you need to test whether the fit is actually good enough. Don't just assume it works.

Formal statistical tests:

Chi-square test bins the data and compares observed vs. expected frequencies in each bin
Kolmogorov-Smirnov (K-S) test measures the maximum vertical distance between the empirical CDF and the theoretical CDF. It works on continuous data without binning.
Anderson-Darling test is similar to K-S but gives more weight to the tails of the distribution, which is useful when tail behavior matters for your simulation

Visual methods complement the formal tests and often reveal problems the tests miss:

Q-Q plots plot sample quantiles against theoretical quantiles. If the fit is good, points fall close to a straight diagonal line. Deviations at the ends indicate poor tail fit.
P-P plots compare cumulative probabilities and are more sensitive to differences in the middle of the distribution
ECDF plots overlay the empirical CDF on the theoretical CDF so you can visually inspect the gap

Model selection criteria help you choose between competing distributions that all pass goodness-of-fit tests:

Akaike Information Criterion (AIC) balances model fit against complexity. Lower AIC is better.
Bayesian Information Criterion (BIC) also balances fit and complexity but penalizes extra parameters more heavily than AIC, favoring simpler models.

Software tools that handle distribution fitting include Arena's Input Analyzer, @RISK, and R's fitdistrplus package.

Model Validation and Comparison

Input Data Collection and Preprocessing, Data preprocessing - Wikipedia

Validation Techniques

Model validation answers a critical question: does your simulation accurately represent the real-world system? A model that runs without errors isn't necessarily a valid model.

Face validation is the simplest starting point. Subject matter experts review the model's logic, assumptions, structure, inputs, and outputs to check whether they seem reasonable. This catches obvious errors early.

Statistical validation uses formal comparisons between simulation output and real-world data:

Hypothesis tests (t-tests, ANOVA) check whether the difference between simulated and real output is statistically significant
Confidence intervals estimate the range within which the true parameter value likely falls, giving you a sense of how precise your model's predictions are

Historical data validation tests predictive accuracy by running the model against past data the model wasn't built on. Compare predictions to known outcomes across different time periods to see if the model generalizes well.

Operational Validation and Visualization

Operational validation tests whether the model performs well enough for its intended purpose. This goes beyond statistical accuracy.

Run pilot simulations to verify the model behaves as expected under normal and extreme conditions
Check whether the model meets specific performance criteria defined by stakeholders

Animation and visualization are surprisingly effective validation tools. Animating an assembly line simulation, for example, lets you visually spot illogical behavior (parts moving backward, queues forming in impossible places). Dynamic charts tracking metrics like queue lengths or resource utilization over time can reveal transient problems that summary statistics would hide.

Sensitivity analysis also serves a validation role here. If the model's output changes dramatically in response to a small change in an input you know is stable in reality, that's a red flag worth investigating.

Sensitivity Analysis for Input Parameters

Local and Global Sensitivity Analysis

Sensitivity analysis tells you which input parameters have the biggest effect on your model's output. This helps you focus data collection efforts on the inputs that actually matter.

Local sensitivity analysis examines one parameter at a time:

Hold all parameters at their baseline values
Vary a single parameter by a small amount
Measure how much the output changes
Repeat for each parameter

This approach is straightforward, but it only captures the effect of each parameter around one specific point and can miss interactions between parameters.

Global sensitivity analysis varies all parameters simultaneously across their full ranges, capturing interactions that local methods miss. Key methods include:

Sobol indices decompose the total output variance into contributions from each parameter (first-order effects) and from parameter interactions (higher-order effects). The total-effect index for a parameter includes both its direct effect and all its interactions.
FAST (Fourier Amplitude Sensitivity Test) uses frequency-based decomposition to achieve similar results, often with fewer simulation runs than Sobol.

Sensitivity Analysis Techniques

One-at-a-time (OAT) analysis is the simplest approach. You vary each parameter individually while holding everything else constant. It's a good starting point for screening, but it won't detect interactions between parameters.

Morris method (elementary effects) is a more efficient screening technique:

Randomly sample points across the full parameter space
Compute the "elementary effect" of each parameter at each sampled point
Parameters with a high mean elementary effect are influential; parameters with a high standard deviation have effects that depend on the values of other parameters (indicating interactions)

Metamodeling techniques build simplified surrogate models when running the full simulation is too expensive:

Response surface methodology fits polynomial equations (often quadratic) to simulation output, letting you explore the input-output relationship quickly
Gaussian process emulation (kriging) builds a statistical surrogate that interpolates between simulation runs and provides uncertainty estimates

Graphical tools make results easier to interpret:

Tornado diagrams rank parameters by their impact on the output, with the most influential parameter at the top. Each bar shows the output range when that parameter varies across its range.
Spider plots show how the output changes as each parameter moves across its range, with all curves plotted on the same axes for comparison