#DynamicProgramming
[[Reinforcement Learning]] [[Markov Decision Process]]
Why is called Dynamic Programming?
dynamic - sequential or stepwise Programming in this case refers to a policy, its a tems from maths - so its an optimisation method for solving complex sequential problems.
It solves problems by breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems
Example: find the shortest path, can be broken into two pieces, find start to mid, find mid to end and combine.
Subproblems recur many times, so solutions can be cached and reused
DP uses an explicit model. DP requires that you know
There are two main methods of Dynamic Programming, [[Policy iteration]] and [[Value Iteration]]
In Policy Iteration - You randomly select a policy and find [[Value Function]] corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .
In Value Iteration - You randomly select a [[Value Function]] , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .
Both Policy Iteration and Value Iteration are control problems, and both tell you have to get the maximium reward from the mdp.
We can view the algorithms side by side:
![[Policy-vsValue-iter.png]]