Reliability theory gives you the mathematical tools to analyze how and when systems and components fail. It's central to engineering, manufacturing, and software development because predicting failure lets you design better systems, plan smarter maintenance, and optimize costs. These notes cover failure time distributions, failure rate functions, system reliability structures, maintenance strategies, and optimization techniques.

Basics of reliability theory

Reliability theory sits at the intersection of probability, statistics, and engineering. Its core question: given a system or component, what's the probability it keeps working for a specified period of time?

The three pillars you need to understand are:

Failure time distributions describe when failures happen probabilistically
Failure rate functions describe how the risk of failure changes over time
System reliability describes how individual component reliabilities combine based on system architecture

Failure time distributions

Probability density functions

The probability density function (PDF) of a failure time random variable $T$ describes the relative likelihood of failure at different times. It's denoted $f(t)$ and must satisfy two properties:

$f(t) \geq 0$ for all $t$
$\int_{-\infty}^{\infty} f(t) \, dt = 1$

You use the PDF to calculate probabilities of failure within specific time intervals. For example, $P(a < T < b) = \int_a^b f(t) \, dt$ . The PDF also serves as the building block for deriving the CDF, survival function, and hazard rate.

Cumulative distribution functions

The cumulative distribution function (CDF) gives the probability that failure occurs at or before time $t$ :

$F(t) = P(T \leq t) = \int_{-\infty}^{t} f(u) \, du$

The CDF is non-decreasing, with $F(-\infty) = 0$ and $F(\infty) = 1$ . In reliability terms, $F(t)$ is the probability that the component has already failed by time $t$ .

Survival functions

The survival function (also called the reliability function) is the complement of the CDF:

$R(t) = P(T > t) = 1 - F(t)$

This gives the probability that the system or component is still working at time $t$ . It's non-increasing, starting at $R(0) = 1$ (the component works at time zero) and approaching $R(\infty) = 0$ (everything fails eventually). The survival function is the most direct measure of reliability over time.

Important lifetime distributions

Exponential distribution

The exponential distribution is the simplest lifetime model, defined by a single constant failure rate $\lambda > 0$ :

$f(t) = \lambda e^{-\lambda t}, \quad t \geq 0$

Its survival function is $R(t) = e^{-\lambda t}$ , and its mean time to failure (MTTF) is $1/\lambda$ .

The defining feature is the memoryless property: the probability of surviving an additional $s$ units of time doesn't depend on how long the component has already been running. Formally, $P(T > t + s \mid T > t) = P(T > s)$ . This makes it appropriate for modeling failures caused by random external shocks rather than gradual degradation.

Weibull distribution

The Weibull distribution is far more flexible, with PDF:

$f(t) = \frac{\beta}{\alpha} \left(\frac{t}{\alpha}\right)^{\beta-1} e^{-\left(t/\alpha\right)^\beta}, \quad t \geq 0$

where $\alpha > 0$ is the scale parameter and $\beta > 0$ is the shape parameter. The shape parameter controls the failure rate behavior:

$\beta < 1$ : decreasing failure rate (early-life failures dominate)
$\beta = 1$ : constant failure rate (reduces to the exponential distribution)
$\beta > 1$ : increasing failure rate (wear-out failures dominate)

This versatility makes the Weibull distribution one of the most widely used models in reliability engineering.

Gamma distribution

The gamma distribution provides another flexible lifetime model with PDF:

$f(t) = \frac{\beta^\alpha}{\Gamma(\alpha)} t^{\alpha-1} e^{-\beta t}, \quad t \geq 0$

where $\alpha > 0$ is the shape parameter, $\beta > 0$ is the rate parameter, and $\Gamma(\cdot)$ is the gamma function. When $\alpha = 1$ , it reduces to the exponential distribution. For integer $\alpha$ , the gamma distribution represents the waiting time until the $\alpha$ -th event in a Poisson process, which makes it natural for modeling systems that fail after accumulating a certain number of shocks or damage increments.

Lognormal distribution

The lognormal distribution applies when $\ln(T)$ follows a normal distribution with parameters $\mu$ and $\sigma$ :

$f(t) = \frac{1}{t \sigma \sqrt{2\pi}} \exp\left(-\frac{(\ln t - \mu)^2}{2\sigma^2}\right), \quad t > 0$

This distribution arises naturally when the failure process results from the multiplicative accumulation of many small random effects (by the central limit theorem applied to the log scale). It's commonly used for modeling fatigue life, crack growth, and degradation processes in materials.

Failure rate functions

Hazard rate vs failure rate

The hazard rate (or hazard function) $h(t)$ captures the instantaneous risk of failure at time $t$ , given survival up to that point:

$h(t) = \frac{f(t)}{R(t)} = -\frac{d}{dt} \ln R(t)$

Think of it this way: $h(t) \, dt$ approximates the probability of failing in the tiny interval $(t, t + dt)$ , conditional on having survived to time $t$ .

The term "failure rate" is sometimes used loosely as a synonym for the hazard rate, but strictly speaking, the failure rate often refers to the average number of failures per unit time (e.g., failures per hour across a population). For exam purposes, know the precise definition of $h(t)$ and how to derive it from $f(t)$ and $R(t)$ .

Bathtub curve

The bathtub curve describes the typical hazard rate trajectory over a product's lifetime. It has three phases:

Infant mortality phase (decreasing hazard rate): Early failures caused by manufacturing defects, assembly errors, or weak components. Burn-in testing is used to screen these out.
Useful life phase (approximately constant hazard rate): Failures are random and unpredictable. The exponential distribution models this phase well.
Wear-out phase (increasing hazard rate): Failures due to aging, fatigue, corrosion, or cumulative damage. Preventive replacement becomes important here.

Identifying which phase a component is in tells you which maintenance strategy to apply and which distribution to use for modeling.

Monotone failure rates

A distribution has a monotone failure rate if its hazard function trends consistently in one direction:

IFR (Increasing Failure Rate): $h(t)$ increases over time. The component degrades with age. Example: mechanical parts subject to wear. The Weibull distribution with $\beta > 1$ is IFR.
DFR (Decreasing Failure Rate): $h(t)$ decreases over time. Surviving longer makes future failure less likely. Example: electronic components that pass the infant mortality phase. The Weibull with $\beta < 1$ is DFR.
CFR (Constant Failure Rate): $h(t)$ is constant. This is the exponential distribution ( $\beta = 1$ ).

IFR and DFR classifications have theoretical consequences for bounding system reliability and choosing maintenance policies.

Estimating lifetime distributions

Parametric methods

Parametric estimation assumes the failure times follow a known distributional form (exponential, Weibull, gamma, etc.) and estimates the parameters from observed data.

Common techniques:

Maximum Likelihood Estimation (MLE): Finds parameter values that maximize the likelihood of the observed data. Generally the most efficient method when the model is correct.
Method of Moments (MOM): Equates sample moments to theoretical moments and solves for parameters. Simpler but often less efficient than MLE.
Bayesian estimation: Incorporates prior information about parameters and produces a posterior distribution.

Parametric methods are statistically efficient when the assumed distribution fits the data well. The risk is model misspecification: if you assume Weibull but the true distribution is lognormal, your estimates and predictions can be biased.

Nonparametric methods

Nonparametric methods estimate the survival or hazard function directly from data without assuming a distributional form. Two key estimators:

Kaplan-Meier estimator: Estimates the survival function $R(t)$ as a step function that drops at each observed failure time. Handles censored data (components that haven't failed yet by the end of the study).
Nelson-Aalen estimator: Estimates the cumulative hazard function $H(t) = \int_0^t h(u) \, du$ .

These methods are more robust to distributional assumptions but typically require larger sample sizes to achieve the same precision as parametric approaches.

System reliability

Probability density functions, Frontiers | Stochastic processes in the structure and functioning of soil biodiversity

Series vs parallel systems

Series systems require every component to function. If any single component fails, the whole system fails. The system reliability is:

$R_s(t) = \prod_{i=1}^{n} R_i(t)$

Because you're multiplying probabilities less than 1, adding more components in series always decreases system reliability. A chain is only as strong as its weakest link.

Parallel systems require at least one component to function. The system fails only when all components fail:

$R_p(t) = 1 - \prod_{i=1}^{n} (1 - R_i(t))$

Adding components in parallel increases system reliability. This is the basic principle behind redundancy.

k-out-of-n systems

A k-out-of-n system functions if and only if at least $k$ of its $n$ components work. This generalizes both series and parallel systems:

$k = n$ : series system (all must work)
$k = 1$ : parallel system (at least one must work)

When components are independent and identically distributed with reliability $p$ , the system reliability is:

$R = \sum_{i=k}^{n} \binom{n}{i} p^i (1-p)^{n-i}$

A practical example: a flight control system that uses triple modular redundancy with majority voting is a 2-out-of-3 system.

Redundancy in system design

Redundancy improves reliability by including extra components beyond the minimum needed. Three main types:

Active redundancy: All redundant components operate simultaneously. If one fails, the others continue without interruption. Simple but means all components experience wear from the start.
Standby redundancy: Backup components remain idle until a primary component fails, then switch in. This can extend component life (the standby unit doesn't degrade while idle), but introduces the risk of switching failures.
Voting redundancy: Multiple components operate in parallel, and the system output is determined by majority vote. Protects against both failures and erroneous outputs.

Redundancy allocation is the optimization problem of deciding where to place redundant components to maximize system reliability under constraints like cost, weight, or volume.

Reliability of maintained systems

Preventive maintenance

Preventive maintenance (PM) involves scheduled actions to prevent or delay failures before they occur. Two main approaches:

Time-based PM: Maintenance at fixed intervals (e.g., replace a filter every 1,000 hours) regardless of component condition.
Condition-based PM: Maintenance triggered by monitored indicators (e.g., vibration levels, temperature readings) that signal degradation.

Condition-based PM is generally more efficient because it avoids unnecessary maintenance on healthy components, but it requires monitoring infrastructure and reliable diagnostic thresholds.

Corrective maintenance

Corrective maintenance (CM) is reactive: you repair or replace a component after it fails. The goal is to restore the system to operation as quickly as possible.

CM effectiveness depends on:

Speed of failure detection
Availability of spare parts and repair equipment
Skill level of maintenance personnel
Whether the repair restores the component to "as good as new" or just "as bad as old"

In reliability modeling, the distinction between perfect repair (component is renewed) and minimal repair (component returns to the state it was in just before failure) significantly affects the mathematical treatment.

Optimal maintenance policies

The goal is to balance PM costs against CM costs and downtime penalties. Common policies include:

Age-based replacement: Replace a component when it reaches a specified age $T^*$ , or upon failure, whichever comes first. The optimal $T^*$ minimizes expected cost per unit time.
Block replacement: Replace all components of a given type at fixed calendar intervals, regardless of individual ages. Simpler to administer but can waste remaining useful life.
Inspection-based maintenance: Perform periodic inspections to detect degradation and schedule repairs before failure.

Finding the optimal policy requires knowledge of the failure time distribution, the costs of PM and CM, and the cost of system downtime. For IFR distributions, age-based replacement policies are particularly effective because the increasing hazard rate means older components are increasingly risky to keep running.

Accelerated life testing

Acceleration factors

Accelerated life testing (ALT) subjects products to stress levels higher than normal operating conditions (elevated temperature, humidity, voltage, vibration, etc.) to induce failures faster. The key assumption is that the failure mechanism remains the same under accelerated and normal conditions; only the time scale changes.

Acceleration factors quantify how much faster failures occur under stress. Common life-stress models:

Arrhenius model: For temperature-driven failures. The acceleration factor depends on $\exp\left(\frac{E_a}{k}\left(\frac{1}{T_{use}} - \frac{1}{T_{stress}}\right)\right)$ , where $E_a$ is activation energy, $k$ is Boltzmann's constant, and $T$ is absolute temperature.
Inverse power law model: For non-thermal stresses (voltage, mechanical load). Relates life to stress as $L \propto S^{-n}$ .
Eyring model: Generalizes the Arrhenius model to handle multiple stress types simultaneously.

Extrapolation to use conditions

The whole point of ALT is to predict reliability under normal conditions from accelerated test data. The process:

Run tests at multiple elevated stress levels.
Fit a life-stress relationship (Arrhenius, inverse power law, etc.) to the failure data at each stress level.
Extrapolate the fitted model down to the normal use stress level to estimate the failure time distribution under actual operating conditions.

Key challenges include choosing the right life-stress model, dealing with multiple competing failure modes (which may have different acceleration behaviors), and ensuring the extrapolation doesn't extend too far beyond the tested stress range, which would make predictions unreliable.

Software reliability

Software reliability growth models

Software reliability growth models (SRGMs) track how reliability improves as faults are found and fixed during testing. Unlike hardware, software doesn't wear out; failures come from latent faults in the code.

SRGMs fall into two categories:

Concave models (e.g., Goel-Okumoto, Musa-Okumoto logarithmic): Assume the fault detection rate decreases over time as fewer faults remain. The cumulative number of detected faults follows a concave curve.
S-shaped models (e.g., Yamada delayed S-shaped, Gompertz): Assume testers need a learning period before fault detection accelerates, producing an initial slow phase followed by rapid detection, then tapering off.

SRGMs are used to predict the number of remaining faults, estimate time to next failure, and decide when software is reliable enough to release.

Debugging vs testing

These are complementary but distinct activities:

Testing executes the software to find failures and assess reliability. It includes unit testing, integration testing, system testing, and fault injection.
Debugging is the process of locating and fixing the faults that caused observed failures.

Testing reveals that a problem exists; debugging resolves it. Both feed into SRGMs: testing generates the failure data, and debugging (ideally) reduces the remaining fault count. Effective strategies like code reviews, automated test suites, and regression testing are essential for driving reliability growth.

Reliability optimization

Reliability allocation

Reliability allocation assigns reliability targets to individual components so the overall system meets its reliability requirement. If you need system reliability of 0.99, how reliable does each subsystem need to be?

Common methods:

Equal apportionment: Assign the same reliability target to every component. Simple but ignores differences in component complexity or criticality.
AGREE method: Allocates reliability based on component complexity (number of parts) and criticality (importance to system function). More realistic.
Feasibility of objectives: Allocates based on what's technically and economically achievable for each component.

Reliability allocation is an early design activity that guides component selection and identifies where engineering effort should be concentrated.

Redundancy allocation

Redundancy allocation determines the optimal number of redundant components at each position in the system to maximize reliability under resource constraints (budget, weight, volume).

This is formulated as a constrained optimization problem:

Objective: Maximize system reliability $R_s$
Decision variables: Number of redundant units at each subsystem position
Constraints: Total cost, weight, or volume limits

Solution methods range from exact approaches (integer programming, dynamic programming) to heuristic/metaheuristic methods (genetic algorithms, simulated annealing) for larger problems where exact solutions are computationally infeasible.

Reliability-redundancy allocation

This extends redundancy allocation by simultaneously optimizing both the reliability level of each component (which you can improve by using higher-quality parts, at higher cost) and the number of redundant components at each position.

The decision variables now include:

Component reliability levels $r_i$ (continuous)
Number of redundant units $n_i$ (integer)

This creates a mixed-integer nonlinear optimization problem. The trade-off is clear: you can achieve a target system reliability by using fewer but more reliable (and expensive) components, or by using more but cheaper components with redundancy. Finding the cost-optimal combination requires balancing these two levers against system-level constraints.