Dynamic programming solves complex optimization problems by breaking them into simpler subproblems and storing those solutions so you never solve the same subproblem twice. In optimal control, this means finding the best sequence of decisions (control inputs) that minimizes or maximizes an objective function over time.

Two properties make a problem suitable for dynamic programming:

Optimal substructure: the optimal solution to the whole problem can be built from optimal solutions to its subproblems.
Overlapping subproblems: the same subproblems appear repeatedly during computation, so storing their solutions pays off.

If a problem lacks either property, dynamic programming probably isn't the right tool.

Bellman's principle of optimality

Bellman's principle states: whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from that first decision.

In practical terms, this means you can't improve an optimal trajectory by changing only its tail. If you're at some intermediate state along an optimal path, the remaining path from that state must itself be optimal. This insight lets you work backward from the final stage, solving smaller "tail" problems first and building up to the full solution.

Optimal substructure property

A problem has optimal substructure when you can divide it into smaller subproblems, solve each one independently, and combine those solutions to get the global optimum. Classic examples include shortest path problems (the shortest path from A to C through B contains the shortest path from B to C) and matrix chain multiplication.

In control, this shows up naturally: the optimal cost-to-go from any intermediate state doesn't depend on how you arrived at that state.

Overlapping subproblems

Subproblems overlap when the same computation appears multiple times during a naive recursive solution. Without dynamic programming, you'd solve these repeated subproblems from scratch each time.

Consider computing the Fibonacci sequence recursively: $F(n) = F(n-1) + F(n-2)$ . A naive approach recalculates $F(k)$ exponentially many times. Storing each result in a table reduces the complexity from exponential to linear. The same principle applies to control problems where many different decision sequences pass through the same intermediate states.

Dynamic programming vs other optimization methods

Dynamic programming is one of several optimization strategies. Its strength lies in problems with both optimal substructure and overlapping subproblems, but it's not always the best choice.

Comparison to greedy algorithms

Greedy algorithms pick the locally best option at each step, hoping that local choices lead to a global optimum. Sometimes they do (Dijkstra's algorithm on non-negative graphs), but often they don't. A greedy controller might pick the cheapest action right now while steering into an expensive region later.

Dynamic programming avoids this trap by evaluating the full downstream consequences of each decision before committing.

Comparison to divide-and-conquer approach

Divide-and-conquer also breaks problems into subproblems and solves them recursively. The key difference is that divide-and-conquer assumes the subproblems are independent and doesn't store their solutions. When subproblems overlap, divide-and-conquer wastes time re-solving them. Dynamic programming adds a storage step (memoization or a table) to eliminate that redundancy.

Elements of dynamic programming

Every dynamic programming formulation in control has the same structural ingredients. Getting these right is the first step toward writing the recursive equations.

Stages and states

Stages represent the sequential decision points in the problem, often corresponding to discrete time steps $k = 0, 1, \ldots, N$ . At each stage, you make a decision.

States capture all the information you need at a given stage to make optimal future decisions. For a mechanical system, the state might be position and velocity. For a resource allocation problem, it might be the remaining budget. The state at stage $k+1$ depends on the state at stage $k$ and the decision made there.

Decisions and policies

A decision (or control input) $u_k$ is the choice you make at stage $k$ . A policy is a rule that maps every possible state to a decision at each stage: $u_k = \mu_k(x_k)$ .

The goal is to find the optimal policy, the one that minimizes (or maximizes) the total objective function across all stages.

Recursive formulation

The core of dynamic programming is expressing the optimal cost-to-go at stage $k$ in terms of the optimal cost-to-go at stage $k+1$ . For a minimization problem with state $x_k$ , decision $u_k$ , stage cost $g_k(x_k, u_k)$ , and state transition $x_{k+1} = f_k(x_k, u_k)$ :

$V_k(x_k) = \min_{u_k} \left[ g_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)) \right]$

This is Bellman's equation. You start from the terminal condition $V_N(x_N)$ (often a terminal cost) and work backward to $V_0(x_0)$ .

Bellman's principle of optimality, Notes on Reinforcement Learning (2): Dynamic Programming - Billy Ian's Short Leisure-time Wander

Optimal value function

The optimal value function $V^*(s)$ gives the best achievable cost (or reward) starting from state $s$ under an optimal policy. Once you've computed $V^*$ for all states, extracting the optimal policy is straightforward: at each state, pick the action that achieves the minimum (or maximum) in Bellman's equation.

Solving dynamic programming problems

The two main implementation strategies differ in direction: start from the top and work down, or start from the bottom and build up.

Top-down vs bottom-up approaches

Top-down (memoization): Start with the original problem. Recursively break it into subproblems. Before solving any subproblem, check whether its solution is already cached. If so, return the cached result; if not, solve it, cache it, and return.
Bottom-up (tabulation): Identify the smallest subproblems (e.g., the final stage). Solve them first and store the results in a table. Iteratively solve larger subproblems using the stored results until you reach the original problem.

Bottom-up typically runs faster because it avoids recursive function call overhead. Top-down can be more memory-efficient when only a subset of all possible subproblems actually gets visited.

Memoization techniques

Memoization is the caching strategy used in top-down dynamic programming:

Set up a lookup table (hash map, array, etc.) indexed by the subproblem's state.
When a recursive call is made, check the table first.
If the result exists, return it immediately.
If not, compute the result recursively, store it in the table, then return it.

This converts an exponential-time recursive algorithm into one that solves each unique subproblem only once.

Time and space complexity

Time complexity depends on the number of distinct subproblems multiplied by the work per subproblem. Dynamic programming often reduces exponential complexity to polynomial. For example, a problem with $N$ stages and $S$ states per stage might go from $O(2^N)$ to $O(N \cdot S)$ .
Space complexity depends on how many subproblem solutions you need to store. Sometimes you can reduce space by discarding solutions from stages you no longer need (e.g., if $V_k$ depends only on $V_{k+1}$ , you only need to keep one stage in memory at a time).

There's often a time-space tradeoff, and the right balance depends on your problem's constraints.

Types of dynamic programming

Deterministic dynamic programming

Here, state transitions are fully known: given state $x_k$ and decision $u_k$ , the next state $x_{k+1} = f_k(x_k, u_k)$ is determined with certainty. Bellman's equation takes the standard form shown above. Examples include shortest path problems, the knapsack problem, and deterministic trajectory optimization.

Stochastic dynamic programming

When state transitions involve randomness, the next state depends on both the decision and a random disturbance $w_k$ : $x_{k+1} = f_k(x_k, u_k, w_k)$ . Bellman's equation becomes an expectation:

$V_k(x_k) = \min_{u_k} \mathbb{E}_{w_k} \left[ g_k(x_k, u_k, w_k) + V_{k+1}(f_k(x_k, u_k, w_k)) \right]$

Markov decision processes (MDPs) are the standard framework here. The Markov property (the future depends only on the current state, not the history) is what makes the recursive decomposition valid.

Infinite-horizon dynamic programming

When there's no fixed terminal time, the problem extends indefinitely. The objective is typically a discounted sum of rewards with discount factor $\gamma \in (0, 1)$ :

$V^*(x) = \min_{u} \left[ g(x, u) + \gamma \, V^*(f(x, u)) \right]$

Because there's no terminal stage to start from, you can't simply work backward. Instead, iterative algorithms are used:

Value iteration: repeatedly apply the Bellman operator until $V$ converges.
Policy iteration: alternate between evaluating a fixed policy and improving it until the policy stabilizes.

Both are guaranteed to converge under standard conditions (bounded costs, $\gamma < 1$ ).

Applications of dynamic programming in control theory

Bellman's principle of optimality, Dinamik Programlama (Dynamic Programming) nedir? | tolpp.com

Optimal control problems

Dynamic programming solves optimal control problems by computing the Hamilton-Jacobi-Bellman (HJB) equation, which is the continuous-time counterpart of Bellman's equation. For a system with state dynamics $\dot{x} = f(x, u)$ and running cost $L(x, u)$ , the HJB equation is:

$-\frac{\partial V}{\partial t} = \min_{u} \left[ L(x, u) + \frac{\partial V}{\partial x} f(x, u) \right]$

Solving this yields both the optimal value function and the optimal control law. Applications include spacecraft trajectory optimization, energy management in hybrid vehicles, and industrial process control.

Adaptive control systems

Adaptive controllers adjust their parameters online as they observe the system's behavior. Dynamic programming provides a principled way to balance exploration (trying new actions to learn about the system) and exploitation (using current knowledge to act optimally). Dual control, for instance, explicitly accounts for the value of information gained through probing actions.

Reinforcement learning algorithms

Reinforcement learning (RL) applies dynamic programming principles without requiring a known system model. The agent learns the value function and policy through trial-and-error interaction with the environment.

Q-learning: estimates the optimal action-value function $Q^*(s, u)$ directly, off-policy.
SARSA: estimates $Q$ for the policy currently being followed, on-policy.
Actor-critic methods: maintain separate estimates of the policy (actor) and value function (critic).

All of these are rooted in Bellman's equation; they just replace the model-based expectation with sampled experience.

Limitations and challenges

Curse of dimensionality

This is the central challenge. As the number of state variables grows, the number of entries in the value function table grows exponentially. A system with $n$ state variables, each discretized into $m$ levels, requires $m^n$ table entries. For $n = 6$ and $m = 100$ , that's $10^{12}$ entries, which is clearly infeasible.

This is why exact dynamic programming is practical only for low-dimensional problems. Higher-dimensional problems require approximation methods.

Numerical stability issues

Backward recursion can accumulate numerical errors, especially when:

The value function spans many orders of magnitude.
Stage costs involve small differences between large numbers.
Discount factors are close to 1 in infinite-horizon problems.

Mitigation strategies include logarithmic scaling of the value function, relative value iteration (subtracting a reference value at each step), and careful choice of numerical precision.

Approximation methods

When exact solutions are intractable, approximation methods trade some accuracy for computational feasibility:

Value function approximation: represent $V(x)$ using a parameterized function (neural network, polynomial basis) instead of a table.
Policy approximation: search directly over a parameterized class of policies.
Model-free methods: learn from sampled trajectories without building an explicit model.

Designing the approximation architecture and tuning the learning algorithm requires care. Poor approximations can lead to instability or convergence to suboptimal policies.

Advanced topics in dynamic programming

Differential dynamic programming

Differential dynamic programming (DDP) is an iterative algorithm for continuous-state optimal control. Rather than discretizing the entire state space, DDP works along a nominal trajectory and approximates the value function locally using a second-order (quadratic) expansion.

Each iteration:

Perform a backward pass along the current trajectory, computing local quadratic approximations of the value function.
Perform a forward pass, applying the updated policy to generate an improved trajectory.
Repeat until convergence.

DDP scales well to high-dimensional systems and has been widely used in robotics (manipulation, locomotion) and aerospace trajectory planning.

Approximate dynamic programming

Approximate dynamic programming (ADP) is a broad family of methods for tackling the curse of dimensionality. It includes:

Function approximation: fitting $V(x)$ or $Q(x, u)$ with neural networks, radial basis functions, or linear architectures.
Sample-based learning: Q-learning, SARSA, and temporal-difference methods that update value estimates from observed transitions.
Policy search: policy gradient methods and actor-critic algorithms that optimize the policy parameters directly.

ADP bridges classical dynamic programming and modern reinforcement learning, enabling application to large-scale and partially observable systems.

Robust dynamic programming

Standard dynamic programming assumes you know the transition model exactly. Robust dynamic programming relaxes this by optimizing against the worst case within an uncertainty set.

Minimax formulation: the controller minimizes cost while "nature" (the uncertainty) maximizes it, leading to $V(x) = \min_u \max_w [g(x, u, w) + V(f(x, u, w))]$ .
Robust MDPs: define uncertainty sets around transition probabilities and optimize for the worst-case distribution.
Distributionally robust optimization: considers ambiguity in the probability distribution itself.

These methods are valuable in finance, supply chain management, and any control application where model mismatch or worst-case performance matters.