Reliability theory gives you the mathematical tools to analyze how and when systems and components fail. It's central to engineering, manufacturing, and software development because predicting failure lets you design better systems, plan smarter maintenance, and optimize costs. These notes cover failure time distributions, failure rate functions, system reliability structures, maintenance strategies, and optimization techniques.
Basics of reliability theory
Reliability theory sits at the intersection of probability, statistics, and engineering. Its core question: given a system or component, what's the probability it keeps working for a specified period of time?
The three pillars you need to understand are:
- Failure time distributions describe when failures happen probabilistically
- Failure rate functions describe how the risk of failure changes over time
- System reliability describes how individual component reliabilities combine based on system architecture
Failure time distributions
Probability density functions
The probability density function (PDF) of a failure time random variable describes the relative likelihood of failure at different times. It's denoted and must satisfy two properties:
- for all
You use the PDF to calculate probabilities of failure within specific time intervals. For example, . The PDF also serves as the building block for deriving the CDF, survival function, and hazard rate.
Cumulative distribution functions
The cumulative distribution function (CDF) gives the probability that failure occurs at or before time :
The CDF is non-decreasing, with and . In reliability terms, is the probability that the component has already failed by time .
Survival functions
The survival function (also called the reliability function) is the complement of the CDF:
This gives the probability that the system or component is still working at time . It's non-increasing, starting at (the component works at time zero) and approaching (everything fails eventually). The survival function is the most direct measure of reliability over time.
Important lifetime distributions
Exponential distribution
The exponential distribution is the simplest lifetime model, defined by a single constant failure rate :
Its survival function is , and its mean time to failure (MTTF) is .
The defining feature is the memoryless property: the probability of surviving an additional units of time doesn't depend on how long the component has already been running. Formally, . This makes it appropriate for modeling failures caused by random external shocks rather than gradual degradation.
Weibull distribution
The Weibull distribution is far more flexible, with PDF:
where is the scale parameter and is the shape parameter. The shape parameter controls the failure rate behavior:
- : decreasing failure rate (early-life failures dominate)
- : constant failure rate (reduces to the exponential distribution)
- : increasing failure rate (wear-out failures dominate)
This versatility makes the Weibull distribution one of the most widely used models in reliability engineering.
Gamma distribution
The gamma distribution provides another flexible lifetime model with PDF:
where is the shape parameter, is the rate parameter, and is the gamma function. When , it reduces to the exponential distribution. For integer , the gamma distribution represents the waiting time until the -th event in a Poisson process, which makes it natural for modeling systems that fail after accumulating a certain number of shocks or damage increments.
Lognormal distribution
The lognormal distribution applies when follows a normal distribution with parameters and :
This distribution arises naturally when the failure process results from the multiplicative accumulation of many small random effects (by the central limit theorem applied to the log scale). It's commonly used for modeling fatigue life, crack growth, and degradation processes in materials.
Failure rate functions
Hazard rate vs failure rate
The hazard rate (or hazard function) captures the instantaneous risk of failure at time , given survival up to that point:
Think of it this way: approximates the probability of failing in the tiny interval , conditional on having survived to time .
The term "failure rate" is sometimes used loosely as a synonym for the hazard rate, but strictly speaking, the failure rate often refers to the average number of failures per unit time (e.g., failures per hour across a population). For exam purposes, know the precise definition of and how to derive it from and .
Bathtub curve
The bathtub curve describes the typical hazard rate trajectory over a product's lifetime. It has three phases:
- Infant mortality phase (decreasing hazard rate): Early failures caused by manufacturing defects, assembly errors, or weak components. Burn-in testing is used to screen these out.
- Useful life phase (approximately constant hazard rate): Failures are random and unpredictable. The exponential distribution models this phase well.
- Wear-out phase (increasing hazard rate): Failures due to aging, fatigue, corrosion, or cumulative damage. Preventive replacement becomes important here.
Identifying which phase a component is in tells you which maintenance strategy to apply and which distribution to use for modeling.
Monotone failure rates
A distribution has a monotone failure rate if its hazard function trends consistently in one direction:
- IFR (Increasing Failure Rate): increases over time. The component degrades with age. Example: mechanical parts subject to wear. The Weibull distribution with is IFR.
- DFR (Decreasing Failure Rate): decreases over time. Surviving longer makes future failure less likely. Example: electronic components that pass the infant mortality phase. The Weibull with is DFR.
- CFR (Constant Failure Rate): is constant. This is the exponential distribution ().
IFR and DFR classifications have theoretical consequences for bounding system reliability and choosing maintenance policies.
Estimating lifetime distributions
Parametric methods
Parametric estimation assumes the failure times follow a known distributional form (exponential, Weibull, gamma, etc.) and estimates the parameters from observed data.
Common techniques:
- Maximum Likelihood Estimation (MLE): Finds parameter values that maximize the likelihood of the observed data. Generally the most efficient method when the model is correct.
- Method of Moments (MOM): Equates sample moments to theoretical moments and solves for parameters. Simpler but often less efficient than MLE.
- Bayesian estimation: Incorporates prior information about parameters and produces a posterior distribution.
Parametric methods are statistically efficient when the assumed distribution fits the data well. The risk is model misspecification: if you assume Weibull but the true distribution is lognormal, your estimates and predictions can be biased.
Nonparametric methods
Nonparametric methods estimate the survival or hazard function directly from data without assuming a distributional form. Two key estimators:
- Kaplan-Meier estimator: Estimates the survival function as a step function that drops at each observed failure time. Handles censored data (components that haven't failed yet by the end of the study).
- Nelson-Aalen estimator: Estimates the cumulative hazard function .
These methods are more robust to distributional assumptions but typically require larger sample sizes to achieve the same precision as parametric approaches.
System reliability

Series vs parallel systems
Series systems require every component to function. If any single component fails, the whole system fails. The system reliability is:
Because you're multiplying probabilities less than 1, adding more components in series always decreases system reliability. A chain is only as strong as its weakest link.
Parallel systems require at least one component to function. The system fails only when all components fail:
Adding components in parallel increases system reliability. This is the basic principle behind redundancy.
k-out-of-n systems
A k-out-of-n system functions if and only if at least of its components work. This generalizes both series and parallel systems:
- : series system (all must work)
- : parallel system (at least one must work)
When components are independent and identically distributed with reliability , the system reliability is:
A practical example: a flight control system that uses triple modular redundancy with majority voting is a 2-out-of-3 system.
Redundancy in system design
Redundancy improves reliability by including extra components beyond the minimum needed. Three main types:
- Active redundancy: All redundant components operate simultaneously. If one fails, the others continue without interruption. Simple but means all components experience wear from the start.
- Standby redundancy: Backup components remain idle until a primary component fails, then switch in. This can extend component life (the standby unit doesn't degrade while idle), but introduces the risk of switching failures.
- Voting redundancy: Multiple components operate in parallel, and the system output is determined by majority vote. Protects against both failures and erroneous outputs.
Redundancy allocation is the optimization problem of deciding where to place redundant components to maximize system reliability under constraints like cost, weight, or volume.
Reliability of maintained systems
Preventive maintenance
Preventive maintenance (PM) involves scheduled actions to prevent or delay failures before they occur. Two main approaches:
- Time-based PM: Maintenance at fixed intervals (e.g., replace a filter every 1,000 hours) regardless of component condition.
- Condition-based PM: Maintenance triggered by monitored indicators (e.g., vibration levels, temperature readings) that signal degradation.
Condition-based PM is generally more efficient because it avoids unnecessary maintenance on healthy components, but it requires monitoring infrastructure and reliable diagnostic thresholds.
Corrective maintenance
Corrective maintenance (CM) is reactive: you repair or replace a component after it fails. The goal is to restore the system to operation as quickly as possible.
CM effectiveness depends on:
- Speed of failure detection
- Availability of spare parts and repair equipment
- Skill level of maintenance personnel
- Whether the repair restores the component to "as good as new" or just "as bad as old"
In reliability modeling, the distinction between perfect repair (component is renewed) and minimal repair (component returns to the state it was in just before failure) significantly affects the mathematical treatment.
Optimal maintenance policies
The goal is to balance PM costs against CM costs and downtime penalties. Common policies include:
- Age-based replacement: Replace a component when it reaches a specified age , or upon failure, whichever comes first. The optimal minimizes expected cost per unit time.
- Block replacement: Replace all components of a given type at fixed calendar intervals, regardless of individual ages. Simpler to administer but can waste remaining useful life.
- Inspection-based maintenance: Perform periodic inspections to detect degradation and schedule repairs before failure.
Finding the optimal policy requires knowledge of the failure time distribution, the costs of PM and CM, and the cost of system downtime. For IFR distributions, age-based replacement policies are particularly effective because the increasing hazard rate means older components are increasingly risky to keep running.
Accelerated life testing
Acceleration factors
Accelerated life testing (ALT) subjects products to stress levels higher than normal operating conditions (elevated temperature, humidity, voltage, vibration, etc.) to induce failures faster. The key assumption is that the failure mechanism remains the same under accelerated and normal conditions; only the time scale changes.
Acceleration factors quantify how much faster failures occur under stress. Common life-stress models:
- Arrhenius model: For temperature-driven failures. The acceleration factor depends on , where is activation energy, is Boltzmann's constant, and is absolute temperature.
- Inverse power law model: For non-thermal stresses (voltage, mechanical load). Relates life to stress as .
- Eyring model: Generalizes the Arrhenius model to handle multiple stress types simultaneously.
Extrapolation to use conditions
The whole point of ALT is to predict reliability under normal conditions from accelerated test data. The process:
- Run tests at multiple elevated stress levels.
- Fit a life-stress relationship (Arrhenius, inverse power law, etc.) to the failure data at each stress level.
- Extrapolate the fitted model down to the normal use stress level to estimate the failure time distribution under actual operating conditions.
Key challenges include choosing the right life-stress model, dealing with multiple competing failure modes (which may have different acceleration behaviors), and ensuring the extrapolation doesn't extend too far beyond the tested stress range, which would make predictions unreliable.
Software reliability
Software reliability growth models
Software reliability growth models (SRGMs) track how reliability improves as faults are found and fixed during testing. Unlike hardware, software doesn't wear out; failures come from latent faults in the code.
SRGMs fall into two categories:
- Concave models (e.g., Goel-Okumoto, Musa-Okumoto logarithmic): Assume the fault detection rate decreases over time as fewer faults remain. The cumulative number of detected faults follows a concave curve.
- S-shaped models (e.g., Yamada delayed S-shaped, Gompertz): Assume testers need a learning period before fault detection accelerates, producing an initial slow phase followed by rapid detection, then tapering off.
SRGMs are used to predict the number of remaining faults, estimate time to next failure, and decide when software is reliable enough to release.
Debugging vs testing
These are complementary but distinct activities:
- Testing executes the software to find failures and assess reliability. It includes unit testing, integration testing, system testing, and fault injection.
- Debugging is the process of locating and fixing the faults that caused observed failures.
Testing reveals that a problem exists; debugging resolves it. Both feed into SRGMs: testing generates the failure data, and debugging (ideally) reduces the remaining fault count. Effective strategies like code reviews, automated test suites, and regression testing are essential for driving reliability growth.
Reliability optimization
Reliability allocation
Reliability allocation assigns reliability targets to individual components so the overall system meets its reliability requirement. If you need system reliability of 0.99, how reliable does each subsystem need to be?
Common methods:
- Equal apportionment: Assign the same reliability target to every component. Simple but ignores differences in component complexity or criticality.
- AGREE method: Allocates reliability based on component complexity (number of parts) and criticality (importance to system function). More realistic.
- Feasibility of objectives: Allocates based on what's technically and economically achievable for each component.
Reliability allocation is an early design activity that guides component selection and identifies where engineering effort should be concentrated.
Redundancy allocation
Redundancy allocation determines the optimal number of redundant components at each position in the system to maximize reliability under resource constraints (budget, weight, volume).
This is formulated as a constrained optimization problem:
- Objective: Maximize system reliability
- Decision variables: Number of redundant units at each subsystem position
- Constraints: Total cost, weight, or volume limits
Solution methods range from exact approaches (integer programming, dynamic programming) to heuristic/metaheuristic methods (genetic algorithms, simulated annealing) for larger problems where exact solutions are computationally infeasible.
Reliability-redundancy allocation
This extends redundancy allocation by simultaneously optimizing both the reliability level of each component (which you can improve by using higher-quality parts, at higher cost) and the number of redundant components at each position.
The decision variables now include:
- Component reliability levels (continuous)
- Number of redundant units (integer)
This creates a mixed-integer nonlinear optimization problem. The trade-off is clear: you can achieve a target system reliability by using fewer but more reliable (and expensive) components, or by using more but cheaper components with redundancy. Finding the cost-optimal combination requires balancing these two levers against system-level constraints.