language-icon Old Web
English
Sign In

Markov decision process

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov. A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov. At each time step, the process is in some state s {displaystyle s} , and the decision maker may choose any action a {displaystyle a} that is available in state s {displaystyle s} . The process responds at the next time step by randomly moving into a new state s ′ {displaystyle s'} , and giving the decision maker a corresponding reward R a ( s , s ′ ) {displaystyle R_{a}(s,s')} . The probability that the process moves into its new state s ′ {displaystyle s'} is influenced by the chosen action. Specifically, it is given by the state transition function P a ( s , s ′ ) {displaystyle P_{a}(s,s')} . Thus, the next state s ′ {displaystyle s'} depends on the current state s {displaystyle s} and the decision maker's action a {displaystyle a} . But given s {displaystyle s} and a {displaystyle a} , it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfies the Markov property. Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. 'wait') and all rewards are the same (e.g. 'zero'), a Markov decision process reduces to a Markov chain. A Markov decision process is a 4-tuple ( S , A , P a , R a ) {displaystyle (S,A,P_{a},R_{a})} , where (Note: The theory of Markov decision processes does not state that S {displaystyle S} or A {displaystyle A} are finite, but the basic algorithms below assume that they are finite.) The core problem of MDPs is to find a 'policy' for the decision maker: a function π {displaystyle pi } that specifies the action π ( s ) {displaystyle pi (s)} that the decision maker will choose when in state s {displaystyle s} . Once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a Markov chain (since the action chosen in state s {displaystyle s} is completely determined by π ( s ) {displaystyle pi (s)} and Pr ( s t + 1 = s ′ ∣ s t = s , a t = a ) {displaystyle Pr(s_{t+1}=s'mid s_{t}=s,a_{t}=a)} reduces to Pr ( s t + 1 = s ′ ∣ s t = s ) {displaystyle Pr(s_{t+1}=s'mid s_{t}=s)} , a Markov transition matrix). The goal is to choose a policy π {displaystyle pi } that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon: where   γ   {displaystyle gamma } is the discount factor and satisfies 0 ≤   γ   ≤   1 {displaystyle 0leq gamma leq 1} . (For example, γ = 1 / ( 1 + r ) {displaystyle gamma =1/(1+r)} when the discount rate is r.) γ {displaystyle gamma } is typically close to 1.

[ "Markov process", "least squares temporal difference", "optimality equation", "bayesian reinforcement learning", "Markov strategy", "discounted cost" ]
Parent Topic
Child Topic
    No Parent Topic