Introduction to Stochastic Dynamic Programming

Stochastic programming is a framework for modeling and solving problems of decision making under uncertainty. Stochastic dynamic programming, originally introduced by Richard Bellman in his seminal book Dynamic Programming, is a branch of Stochastic programming that deals with multistage decision processes and takes a "functional equation" approach to the discovery of optimum policies.

For a comprehensive discussion on Stochastic dynamic programming please refer to Bellman's Dynamic Programming book.

General structure of a stochastic dynamic program

To model problems via stochastic dynamic programming one has to specify

A planning horizon comprising T periods
The finite set S[t] of possible states in which the system may be found in period t=1,...,T
The finite set A[s] of possible actions that may be taken in state s belonging to S[t]
The transition probability p[i,j,a] from state i to state j when action a is taken in state i
The immediate cost (resp. profit) incurred if action a belonging to A[s] is taken in state s
The functional equation f[t,s] denoting the minimum expected total cost (resp. maximum expected total profit) incurred over periods t,t+1,...,T when the system is in state s belonging to S[t] at the beginning of period t.

Let the initial state of the system at the beginning of period 1 be i, the goal in stochastic dynamic programming is to determine f[1,i].

A motivating example: stochastic inventory control

Consider a 3-period inventory control problem. At the beginning of each period the firm should decide how many units of a product should be produced. If production takes place for x units, where x > 0, we incur a production cost c(x). This cost comprises both a fix and a variable component: c(x) = 0, if x = 0; c(x) = 3+2x, otherwise.

Production in each period cannot exceed 4 units. Demand in each period takes two possible values: 1 or 2 units with equal probability (0.5). Demand is observed in each period only after production has occurred. After meeting current period's demand holding cost h=$1 per unit is incurred for any item that is carried over from one period to the next. Because of limited capacity the inventory at the end of each period cannot exceed 3 units. All demand should be met on time (no backorders). If at the end of the planning horizon (i.e. period 3) the firm still has units in stock, these can be salvaged at b=$2 per unit. The initial inventory is 1 unit.

Our modelling strategy neatly follows the steps presented for the previous example. Once more, we introduce a class to capture system states, as well as functional interfaces to capture actions, state transitions, and costs associated with a given state.

Stochastic inventory control as a stochastic dynamic program

We formulate the stochastic inventory control problem as a stochastic dynamic program.

There are 4 periods in the planning horizon
The state s in period t represents the initial inventory level at the beginning of period t, which takes values in 1,...,3
The action given state s in period t is the order quantity Q, which takes values in 0,...,4-s
The transition probability p[i,j,a] from state i to state j when action a is taken in state i is derived from the probability distribution of the demand d in period t
The immediate cost c(t,s,Q) incurred if action Q is taken in state s at the beginning of period t=1,..,3 is 3+2 Q +E[h max(s+Q-d,0)] if Q > 0, E[h max(s+Q-d,0)] otherwise; finally, in period 4 c(4,s,Q)=E[b max(s+Q-d,0)].

The functional equation is

f(t,s)=min c(t,s,Q) + E[f(t+1,s+Q-d)]

with boundary condition f(T,s)=min c(4,s,Q); the aim is to determine f(1,1).

Given the functional equation, an optimal betting policy can be obtained via forward recursion or backward recursion.

A motivating example: Gambler's ruin

This problem is adapted from W. L. Winston, Operations Research: Applications and Algorithms (7th Edition), Duxbury Press, 2003, chap. 19, example 3.

A gambler has $2, she is allowed to play a game of chance 4 times and her goal is to maximize her probability of ending up with a least $6.

If the gambler bets $b on a play of the game, then with probability 0.4 she wins the game, recoup the initial bet, and she increases her capital position by $b; with probability 0.6, she loses the bet amount $b.

On any play of the game, the gambler may not bet more money than she has available.

Determine a betting strategy that will maximize the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon.

Gambler's ruin as a stochastic dynamic program

We formulate Gambler's ruin as a stochastic dynamic program.

There are T=4 games in the planning horizon
The state s in period t represents the initial wealth at the beginning of period t
The action given state s in period t is the bet amount b
The transition probability p[i,j,a] from state i to state j when action a is taken in state i is easily derived from the probability of winning a game (0.4)
Let f[t,s] be the probability that, by the end of game 4, the gambler has at least $6, given that she has $s at the beginning of game t. The immediate profit incurred if action a is taken in state s is given by the expected value 0.4 f[t+1,s+b]+0.6 f[t+1,s-b]

To derive the functional equation define b[t,s] a bet that attains f[t,s]. At the beginning of game 4

if s<3 it is impossible to attain the goal, i.e. f[4,s]=0 for s<3
if s>5 the goal is attained, i.e f[4,s]=1 for s>5
if 2<s<6 the gambler should bet enough to attain the goal, i.e. f[4,s]=0.4 for 2<s<6.