기록: UCL Course on RL 요약 및 정리 (Lecture 2: Markov Decision Processes)
This post covers not only Lecture 2: Markov Decision Processes (davidsilver.uk). but also:
- Markov decision process - Wikipedia
- Markov Decision Processes and Exact Solution Methods
- The entire course materials can be found on the following page: Teaching - David Silver.
Lecture Goal
- Learn about state-value and action-value functions
- Learn about Bellman equations that describe the relationship between those two functions
- Bellman expectation equation
- Bellman optimality equation (a.k.a. Bellman equation)
![]() |
Dr. Richard Bellman | via breves-de-maths.fr |
- The value function of a state (a.k.a. state-value function) is decomposed into two parts:
- Immediate reward
- Discounted value of successor state
- \(V(s) = R(s) + \sum_{s' \in S}(\gamma V(s')T(s'|s)) = \mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s]\)
- Note: Conditional expectation
- \(\mathbb{E}[A|B=b] = \sum_{a \in A}{a Pr(A=a|B=b)}\)
- \(V(1) = R(1) + \sum_{i=1}^{n}\gamma V(i)T(i|1)\)
- \(\begin{bmatrix}V(1)\\ \vdots \\ V(n)\end{bmatrix} = \begin{bmatrix}R(1)\\ \vdots \\ R(n)\end{bmatrix} + \gamma \begin{bmatrix} T(1|1) \ \cdots \ T(n|1) \\ \vdots\;\;\;\;\; \ \ddots\;\;\;\;\; \ \vdots \\ T(1|n) \ \cdots \ T(n|n) \end{bmatrix} \begin{bmatrix}V(1)\\ \vdots \\ V(n)\end{bmatrix}\)
- \(v = c + \gamma\, Tv\), where c is the constant vector if all rewards and transition probabilities are given.
- \(\therefore v = (I-\gamma\, T)^{-1}c\)
- Note: Solving a linear system has a known time complexity of \(O(n^3)\) for \(n\) states. The direct solution is only possible for a small MRP system. There are many iterative methods for large MRPs:
- Jacobi method, Gauss-Seidel method, SOR method, etc.
- Dynamic programming
- Monte Carlo evaluation
- Temporal-difference learning
- Assumption:
- Given an MDP \((S, A, T, R, \gamma)\) and a policy \(\pi\),
- The policy depends on the current state, not the history. Thus, a policy is stationary (time-independent).
- MDPs reduce to MRPs:
- \(T(s'|s) := \sum_{a \in A}(T(s'|s,a)\pi(a|s))\)
- \(R(s) := \sum_{a \in A}(R(s,a)\pi(a|s))\)
- Revisit: state-value function
- 어떤 state에 도착하므로써 (미래도 고려하여) 얼마나 많은 return을 얻을 수 있을지
- The state-value function of an MDP is the expected return starting from the current state \(s\), and then following policy \(pi\):
- \(v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s]\)
- Action-value function \(q_{\pi}(s,a)\)
- 어떤 액션을 취했을 때 얼마나 많은 return을 얻을 수 있을지
- The action-value function is the expected return starting from the current state \(s\), taking action \(a\), and then following policy \(pi\):
- \(q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t|S_t=s, A_t=a]\)
- Note: \(s\) and \(a\) are constants whereas \(S_t\) and \(A_t\) are random variables.
Bellman Expectation Equation for State-Value
- \(v_{\pi}(s) = \mathbb{E}_{a \sim \pi(s)}[R_{t+1} + \gamma\, v_{\pi}(S_{t+1}) | S_t = s]\)
\(v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left ( R(s,a) + \gamma \sum_{s' \in S} T(s'|s,a) v_{\pi}(s') \right )\) - 어떤 상태가 얼마나 많은 가치를 지녔는지 재귀적으로 표현
- Return의 재귀적인 정의에서 바로 유도됨
Bellman Expectation Equation for Action-Value
- \(q_{\pi}(s,a) = \mathbb{E}_{a \sim \pi(s)}[R_{t+1} + \gamma\, q_{\pi}(S_{t+1},A_{t+1}) |S_t=s, A_t=a] \)
\(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) \sum_{a' \in A}{\pi(a'|s')q_{\pi}(s',a')}\)
- Action-to-state
- 어떤 상태의 가치는 그 상태에서 취할 수 있는 액션들의 가치의 총합
- \(v_{\pi}(s) = \sum_{a \in A}{\pi(a|s)q_{\pi}(s,a)}\)
- State-to-action
- 어떤 액션의 가치는 그 액션으로 인한 리워드와 액션으로 갈 수 있는 상태들의 가치의 총합 (액션으로 어떤 상태에 도달할 지는 환경의 transition probabilities가 정함)
- \(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi}(s')\)
Bellman Equations in Matrix Form (2)
- The above Bellman equations for a fixed policy \(\pi\) are nothing but linear systems if all rewards, transition probabilities, and the policy are given.
- Note: If the policy changes, solutions (state- and action-value) change. How to define a good policy? What would be the best (optimal) policy?
Def. Optimal Value Function
- The optimal state-value function is the maximum value function over all policies: \(v_{*}(s) = \underset{\pi}{\max}v_{\pi}(s) \forall s\).
- The optimal action-value function is the maximum action-value function over all policies \(q_{*}(s,a) = \underset{\pi}{\max}q_{\pi}(s,a) \forall s,a\).
- 따라서 직관적으로, \(q_{*}\)가 주어졌다면, 당연히 \(q_{*}(s,a)\)가 큰 action을 취하는 게 optimal policy다.
Thm. Optimal Policy
- Define a partial ordering over policies: \(\pi \ge \pi'\) if \(v_{\pi}(s) \ge v_{\pi'}(s) \forall v\).
- For any Markov Decision Process,
- Existence of an optimal policy: \( \exists \pi_{*} | \forall \pi\; \pi_{*} \ge \pi \)
- All optimal policies achieve the optimal state-value function, \(v_{\pi_{*}}(s)=v_{*}(s)\).
- (partial ordering과 optimal value function의 정의에 따르면 당연)
- All optimal policies achieve the optimal action-value function, \(q_{\pi_{*}}(s,a)=q_{*}(s,a)\).
- An optimal policy can be found by maximizing over \(q_{*}\),
- \(\pi_{*}(a|s) = \left\{\begin{matrix}1 \;\;\; \mathrm{if} \; a=\underset{a \in A}{\mathrm{argmax}}\; q_{*}(s,a) \\ 0 \;\;\; otherwise \hfill \end{matrix}\right.\)
- Note: 즉, optimal action-value를 갖는 action 중 최고의 action 하나만 선택하는 policy는 optimal policy다. 따라서 optimal action-value function을 구했다면 optimal policy를 결정하여 MDP를 풀 수 있다.
- There is always a deterministic optimal policy for any MDP.
- Action-to-state
- \(v_{*}(s) = \underset{a \in A}{\max}q_{*}(s,a)\)
- Proof)
- From the state-action relationship, \(v_{\pi}(s) = \sum_{a \in A}{q_{\pi}(s,a)\pi(a|s)}\)
- From the theorem of the optimal policy, \(v_{*}(s) = v_{\pi_{*}}(s) \\ = \sum_{a \in A}{q_{\pi_{*}}(s,a)\pi_{*}(a|s)} \\ = \sum_{a \in A}{q_{*}(s,a)\pi_{*}(a|s)} \\ = 1 \cdot \underset{a \in A}{\max}q_{*}(s,a) + 0 \cdot (otherwise)\)
- \(\therefore v_{*}(s) = \underset{a \in A}{\max}q_{*}(s,a)\)
- Note: 사실 굳이 수학적으로 엄밀하게 증명하지 않더라도, 직관적으로 state의 value는 해당 state에서 고를 수 있는 action들의 value의 평균 (policy를 고려한). Optimal policy는 optimal actions 중에서도 오직 최고의 액션 하나만 고르므로, optimal state-value = maximum optimal action-value.
- State-to-action
- \(q_{*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{*}(s')\)
- Proof)
- \(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi}(s')\)
- \(q_{\pi*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi*}(s')\)
- 직관적으로, 어떤 액션의 가치는 그 액션으로 인한 리워드와 액션으로 갈 수 있는 상태들의 가치의 총합이고, 액션으로 어떤 상태에 도달할 지는 환경이 정하는데, policy는 환경을 바꾸지 못하므로 어쨌든 가능한 모든 다음 state에 관해 expectation을 구해야 한다.
Bellman Optimality Equation for State-Value
- \(v_{*}(s) = \underset{a \in A}{\max}\left(R(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)v_{*}(s')} \right)\)
- Note: Bellman equation for MRPs와 꼴이 비슷하지만 최대화의 개념을 포함함에 유의하라.
Bellman Optimality Equation for Action-Value
- \(q_{*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)\underset{a' \in A}{\max}q_{*}(s',a')}\)
- Note: 위 Bellman equation과 꼴은 흡사하지만 최대화하는 액션이 현재 액션이 아니라 다음 액션 \(a'\)임에 유의하라.
Solving the Bellman Optimality Equation
- Non-linear (due to the \(\max\) operators)
- No closed-form solution in general
- Iterative solution methods
- Value iteration
- Policy iteration
- Q-learning
- SARSA
Extensions to MDPs
- Infinite and continuous MDPs
- Partially observable MDPs
- Undiscounted, average reward MDPs
댓글
댓글 쓰기