기록: UCL Course on RL 요약 및 정리 (Lecture 2: Markov Decision Processes)

8월 08, 2021

This post covers not only Lecture 2: Markov Decision Processes (davidsilver.uk). but also:

Markov decision process - Wikipedia
Markov Decision Processes and Exact Solution Methods
The entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

Learn about state-value and action-value functions
Learn about Bellman equations that describe the relationship between those two functions

Bellman expectation equation
Bellman optimality equation (a.k.a. Bellman equation)

Dr. Richard Bellman | via breves-de-maths.fr

Def. Bellman Equation for MRPs

The value function of a state (a.k.a. state-value function) is decomposed into two parts:

Immediate reward
Discounted value of successor state

\(V(s) = R(s) + \sum_{s' \in S}(\gamma V(s')T(s'|s)) = \mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s]\)
Note: Conditional expectation

\(\mathbb{E}[A|B=b] = \sum_{a \in A}{a Pr(A=a|B=b)}\)

Bellman Equation in Matrix Form & Its Solution

\(V(1) = R(1) + \sum_{i=1}^{n}\gamma V(i)T(i|1)\)
\(\begin{bmatrix}V(1)\\ \vdots \\ V(n)\end{bmatrix} = \begin{bmatrix}R(1)\\ \vdots \\ R(n)\end{bmatrix} + \gamma \begin{bmatrix} T(1|1) \ \cdots \ T(n|1) \\ \vdots\;\;\;\;\; \ \ddots\;\;\;\;\; \ \vdots \\ T(1|n) \ \cdots \ T(n|n) \end{bmatrix} \begin{bmatrix}V(1)\\ \vdots \\ V(n)\end{bmatrix}\)
\(v = c + \gamma\, Tv\), where c is the constant vector if all rewards and transition probabilities are given.
\(\therefore v = (I-\gamma\, T)^{-1}c\)
Note: Solving a linear system has a known time complexity of \(O(n^3)\) for \(n\) states. The direct solution is only possible for a small MRP system. There are many iterative methods for large MRPs:

Jacobi method, Gauss-Seidel method, SOR method, etc.
Dynamic programming
Monte Carlo evaluation
Temporal-difference learning

Def. Policy Equations for MDPs

Assumption:

Given an MDP \((S, A, T, R, \gamma)\) and a policy \(\pi\),
The policy depends on the current state, not the history. Thus, a policy is stationary (time-independent).

MDPs reduce to MRPs:

\(T(s'|s) := \sum_{a \in A}(T(s'|s,a)\pi(a|s))\)
\(R(s) := \sum_{a \in A}(R(s,a)\pi(a|s))\)

Def. Action-Value Function for MDPs

Revisit: state-value function

어떤 state에 도착하므로써 (미래도 고려하여) 얼마나 많은 return을 얻을 수 있을지
The state-value function of an MDP is the expected return starting from the current state \(s\), and then following policy \(pi\):

\(v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s]\)

Action-value function \(q_{\pi}(s,a)\)

어떤 액션을 취했을 때 얼마나 많은 return을 얻을 수 있을지
The action-value function is the expected return starting from the current state \(s\), taking action \(a\), and then following policy \(pi\):

\(q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t|S_t=s, A_t=a]\)

Note: \(s\) and \(a\) are constants whereas \(S_t\) and \(A_t\) are random variables.

Bellman Expectation Equation for State-Value

\(v_{\pi}(s) = \mathbb{E}_{a \sim \pi(s)}[R_{t+1} + \gamma\, v_{\pi}(S_{t+1}) | S_t = s]\)
\(v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left ( R(s,a) + \gamma \sum_{s' \in S} T(s'|s,a) v_{\pi}(s') \right )\)
어떤 상태가 얼마나 많은 가치를 지녔는지 재귀적으로 표현
Return의 재귀적인 정의에서 바로 유도됨

Bellman Expectation Equation for Action-Value

\(q_{\pi}(s,a) = \mathbb{E}_{a \sim \pi(s)}[R_{t+1} + \gamma\, q_{\pi}(S_{t+1},A_{t+1}) |S_t=s, A_t=a] \)
\(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) \sum_{a' \in A}{\pi(a'|s')q_{\pi}(s',a')}\)

State-Action Relationship

Action-to-state

어떤 상태의 가치는 그 상태에서 취할 수 있는 액션들의 가치의 총합
\(v_{\pi}(s) = \sum_{a \in A}{\pi(a|s)q_{\pi}(s,a)}\)

State-to-action

어떤 액션의 가치는 그 액션으로 인한 리워드와 액션으로 갈 수 있는 상태들의 가치의 총합 (액션으로 어떤 상태에 도달할 지는 환경의 transition probabilities가 정함)
\(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi}(s')\)

Bellman Equations in Matrix Form (2)

The above Bellman equations for a fixed policy \(\pi\) are nothing but linear systems if all rewards, transition probabilities, and the policy are given.
Note: If the policy changes, solutions (state- and action-value) change. How to define a good policy? What would be the best (optimal) policy?

Def. Optimal Value Function

The optimal state-value function is the maximum value function over all policies: \(v_{*}(s) = \underset{\pi}{\max}v_{\pi}(s) \forall s\).
The optimal action-value function is the maximum action-value function over all policies \(q_{*}(s,a) = \underset{\pi}{\max}q_{\pi}(s,a) \forall s,a\).

따라서 직관적으로, \(q_{*}\)가 주어졌다면, 당연히 \(q_{*}(s,a)\)가 큰 action을 취하는 게 optimal policy다.

Thm. Optimal Policy

Define a partial ordering over policies: \(\pi \ge \pi'\) if \(v_{\pi}(s) \ge v_{\pi'}(s) \forall v\).
For any Markov Decision Process,

Existence of an optimal policy: \( \exists \pi_{*} | \forall \pi\; \pi_{*} \ge \pi \)
All optimal policies achieve the optimal state-value function, \(v_{\pi_{*}}(s)=v_{*}(s)\).

(partial ordering과 optimal value function의 정의에 따르면 당연)

All optimal policies achieve the optimal action-value function, \(q_{\pi_{*}}(s,a)=q_{*}(s,a)\).

An optimal policy can be found by maximizing over \(q_{*}\),

\(\pi_{*}(a|s) = \left\{\begin{matrix}1 \;\;\; \mathrm{if} \; a=\underset{a \in A}{\mathrm{argmax}}\; q_{*}(s,a) \\ 0 \;\;\; otherwise \hfill \end{matrix}\right.\)
Note: 즉, optimal action-value를 갖는 action 중 최고의 action 하나만 선택하는 policy는 optimal policy다. 따라서 optimal action-value function을 구했다면 optimal policy를 결정하여 MDP를 풀 수 있다.
There is always a deterministic optimal policy for any MDP.

State-Action Relationship at Optimality

Action-to-state

\(v_{*}(s) = \underset{a \in A}{\max}q_{*}(s,a)\)
Proof)

From the state-action relationship, \(v_{\pi}(s) = \sum_{a \in A}{q_{\pi}(s,a)\pi(a|s)}\)
From the theorem of the optimal policy, \(v_{*}(s) = v_{\pi_{*}}(s) \\ = \sum_{a \in A}{q_{\pi_{*}}(s,a)\pi_{*}(a|s)} \\ = \sum_{a \in A}{q_{*}(s,a)\pi_{*}(a|s)} \\ = 1 \cdot \underset{a \in A}{\max}q_{*}(s,a) + 0 \cdot (otherwise)\)
\(\therefore v_{*}(s) = \underset{a \in A}{\max}q_{*}(s,a)\)

Note: 사실 굳이 수학적으로 엄밀하게 증명하지 않더라도, 직관적으로 state의 value는 해당 state에서 고를 수 있는 action들의 value의 평균 (policy를 고려한). Optimal policy는 optimal actions 중에서도 오직 최고의 액션 하나만 고르므로, optimal state-value = maximum optimal action-value.

State-to-action

\(q_{*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{*}(s')\)
Proof)

\(q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi}(s')\)
\(q_{\pi*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s'|s,a) v_{\pi*}(s')\)
직관적으로, 어떤 액션의 가치는 그 액션으로 인한 리워드와 액션으로 갈 수 있는 상태들의 가치의 총합이고, 액션으로 어떤 상태에 도달할 지는 환경이 정하는데, policy는 환경을 바꾸지 못하므로 어쨌든 가능한 모든 다음 state에 관해 expectation을 구해야 한다.

Bellman Optimality Equation for State-Value

\(v_{*}(s) = \underset{a \in A}{\max}\left(R(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)v_{*}(s')} \right)\)
Note: Bellman equation for MRPs와 꼴이 비슷하지만 최대화의 개념을 포함함에 유의하라.

Bellman Optimality Equation for Action-Value

\(q_{*}(s,a) = R(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)\underset{a' \in A}{\max}q_{*}(s',a')}\)
Note: 위 Bellman equation과 꼴은 흡사하지만 최대화하는 액션이 현재 액션이 아니라 다음 액션 \(a'\)임에 유의하라.

Solving the Bellman Optimality Equation

Non-linear (due to the \(\max\) operators)
No closed-form solution in general
Iterative solution methods

Value iteration
Policy iteration
Q-learning
SARSA

Extensions to MDPs

Infinite and continuous MDPs
Partially observable MDPs
Undiscounted, average reward MDPs

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 2: Markov Decision Processes)

댓글

댓글 쓰기

이 블로그의 인기 게시물

시작: 블로그의 방향성

CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

기록: 강화학습 독학을 위한 자료들 (Materials for Studying Reinforcement Learning from Scratch)