기록: UCL Course on RL 요약 및 정리 (Lecture 3: Planning by Dynamic Programming)

This post covers not only Lecture 3: Planning by Dynamic Programming (davidsilver.uk). but also:

Lecture Goal

Learn how to solve a known MDP where all elements in the MDP tuple (e.g., reward, transition probabilities, etc.) are given

Solving an MDP = finding an optimal value function or policy
Policy iteration

Policy evaluation: estimate the value function
Policy improvement: optimize the value function

Value iteration
Banach Fixed Point Theorem (a.k.a. Contraction Mapping Theorem) to prove the convergence of policy iteration and value iteration

Requirements for Dynamic Programming

Optimal substructure

An optimal solution can be constructed from optimal solutions of its subproblems.

Overlapping subproblems

The same subproblems recur many times.
Solutions can be cached and reused for efficiency.

Markov Decision Process and Dynamic Programming

Recall:

\(v_{*}(s) = \underset{a \in A}{\max}\left(\mathcal{R}(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)v_{*}(s')} \right)\)
\(q_{*}(s,a) = \mathcal{R}(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)\underset{a' \in A}{\max}q_{*}(s',a')}\)
위 방정식은 \(\mathcal{R}\)과 \(T(s'|s,a)\), \(\pi_{*}\)가 주어졌을 때는 단지 linear system일 뿐이지만, 실제 문제에서 state의 개수가 매우 많은 linear system을 푸는 것은 매우 많은 컴퓨팅 자원을 요한다.
또한, 우리는 최적의 policy인 \(\pi_{*}\)를 모른다!

Optimal substructure

Bellman equation requires recursive evaluation of optimal values.

Overlapping subproblems

Value functions (state- and action-value) can be stored and reused.

Assumption: \(\mathcal{R}\) and \(T(s'|s,a)\) (a.k.a. model) are given.

Recall: A model is an agent's representation of the environment. Uknown in a typical RL setting (though approximation and learning may be an option)!!
Dynamic programming is used for planning in an MDP, not learning.

Policy Evaluation

Policy evaluation is a prediction problem

Input: an MDP and a policy
Output: value function of the policy

어떤 policy \(\pi\) 가 주어졌을 때 이 policy가 얼마나 좋은 policy인지 판단할 방법은 결국 \(v_{\pi}(s)\) 를 모든 \(s\)에 대해 구해보는 것. Bellman equation for state-value를 참고하면, \(v_{\pi}\) 는 다음을 만족해야 한다:
\(v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left (\mathcal{R}(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)v_{\pi}(s')} \right) \)
결국 위 linear system의 해를 구해야 한다.
Iterative method

\(x^{(k+1)} = b + Ax^{(k)}\) [simplified notation of a MDP]
MDP에서 수열 \(\{x^{(k)}\}_{k=1, \cdots}\)는 결국 수렴하여 \(x = b + Ax\) 를 만족하는 \(x\)를 찾게 된다.
초기 state-value를 (e.g., \(x^{(0)}\)) 이용해서 주어진 policy에 대응하는 state-value를 (e.g., \(x^{(\infty)}\)) iterative하게 찾는다. \(v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_{\pi}\)
Note: MDP 에서 \(A\)의 infinity-norm은 그냥 \(\gamma \in [0,1]\)이다. 이 성질이 수렴성 증명의 열쇠다.

Note: 위 linear system의 해가 존재하고 그 해가 유일함은 Banach Fixed Point Theorem으로 증명 가능하다 (정리 노트 최하단 참고).

Policy Improvement and Policy Iteration

How to improve a policy?
Policy improvement is a control problem:

Input: an MDP
Output: the optimal value function or the optimal policy

Policy iteration

Given a policy \(\pi\),
Evaluate the policy \(v_{\pi}(s) \forall s\in S\)
Improve the policy by acting greedily with respect to \(v_{\pi}\)

Define the greedy policy as follow:
\(\pi'(s) := \underset{a \in A}{\mathrm{argmax}}\; q_{\pi}(s,a) \forall s \in S\)
This greedy policy \(\pi'(s)\) improves or at least retains the original policy \(\pi(s)\)'s value. See Thm. 1.

Repeat this process until the policy \(\pi\) converges
Then, we eventually get the optimal policy. See Thm. 2.

Summary

Policy evaluation

Estimate \(v_{\pi}\)
Iterative policy evaluation

Policy improvement

Generate \(\pi' \ge \pi\)
Greedy policy improvement

Thm. 1 The greedy policy improves or at least retains the original policy's value.

Proof)

Given a policy \(\pi(a|s)\) (doesn't matter whether it is deterministic or not),
\(\pi'(s) := \underset{a \in A}{\mathrm{argmax}}\; q_{\pi}(s,a) \forall s \in S\)
Then, \(q_{\pi}(s, \pi'(s))\)는 s에서 바로 취하는 액션은 \(\pi'\)를 따라 결정하고, 그 뒤 액션들은 \(\pi\)를 따라 결정할 때의 action-state이다.
\(q_{\pi}(s, \pi'(s)) = \underset{a \in A}{\max}q_{\pi}(s,a) \ge \sum_{a \in A}{q_{\pi}(s,a)\pi(a|s)}\)

최대값은 당연히 평균값보다 큼

이때, \(\sum_{a \in A}{q_{\pi}(s,a)\pi(a|s)} = v_{\pi}(s)\)
\(\therefore q_{\pi}(s, \pi'(s)) \ge v_{\pi}(s)\)
따라서, \(v_{\pi}(s) \le q_{\pi}(s, \pi'(s)) \le q_{\pi'}(s, \pi'(s))\)

Value는 미래의 각 state를 따라 recursive하게 정의되므로, 현재 액션 이후의 액션들도 \(\pi'\)를 따르면 value가 monotonic increasing

이때, \(q_{\pi'}(s, \pi'(s)) = v_{\pi'}(s)\)

\(\pi'\) is deterministic.

\(\therefore v_{\pi}(s) \le v_{\pi'}(s) \; \blacksquare \)

Thm. 2 If \(\pi \equiv \pi'\), then both policies are optimal.

Proof)

\(\pi(s) = \pi'(s) := \underset{a \in A}{\mathrm{argmax}}\; q_{\pi}(s,a) \forall s \in S\)
즉, Bellman optimality equation을 만족하는 \(\pi\)를 찾은 셈이다. \(\blacksquare\)

Thm. 3 (Principle of Optimality) A policy \(\pi\) achieves the optimal value from state \(s\), \(v_{\pi}(s) = v_{*}(s)\), if and only if:

For any state \(s'\) reachable from \(s\), \(\pi\) achieves the optimal value from state \(s'\), \(v_{\pi}(s') = v_{*}(s')\).
Proof)

Omitted.

Note: Bellman optimality equation has optimal substructure.

Value Iteration

Motivation: policy iteration에는 불필요한 연산이 많음

Policy iteration은 policy evaluation loop이 완전히 수렴하고 난 뒤 greedy하게 새로운 정책을 구성한다.
Policy improvment 과정 에서 greedy choice를 하기 때문에 action들의 절대적인 value보다는 상대적인 value, 특히 그 중 value가 최대인 게 무엇인지만 알면 된다.
따라서 policy evaluation를 위한 iterative method가 수렴하기 전에도 action-value 간의 상하 관계만 정확히 얻을 수 있다면, 굳이 policy evaluation이 수렴하기 까지 기다릴 필요 없다.

강의 중 소개된 Small Gridworld 예시에서도, policy evaluation loop를 세 번만 돌았는데도 이미 policy evaluation이 수렴한 것과 동일한 greedy policy를 (오른쪽 열) 얻을 수 있었다. 따라서 굳이 \(k=4\) 이상의 계산을 할 필요가 없다.
\(\Rightarrow\) modified policy iteration: policy evaluation loop을 딱 \(k\) 번만 돌고 바로 policy improvement 해버리자! (\(k\)는 hyperparameter)

Value iteration

\(\Rightarrow\) policy evaluation loop을 딱 한 번만 돌고 바로 policy improvement 해버리자!
Value iteration은 policy evaluation loop의 각 step마다 policy update를 한다 (실제로 policy 자체를 매번 계산하는 건 아니고, value를 update함).
Recall: \(v_{*}(s) = \underset{a \in A}{\max}\left(\mathcal{R}(s,a) + \gamma \sum_{s' \in S}{T(s'|s,a)v_{*}(s')} \right)\)
Given the initial state-values \(v_{0}(s)\; \forall s \in S\),
\(V^{(k+1)} = \underset{a \in A}{\max}\left(R^a + \gamma T^a V^{(k)}\right)\)
Note: Thm 3.에 의해 Bellman optimality equation도 expectation equation처럼 DP를 적용 가능하므로, 마치 policy evaluation을 하는 것처럼, Bellman optimality equation도 iterative하게 풀어내는 게 value iteration이다. (즉, 굳이 위 motivation처럼 policy iteration의 단점을 짚지 않아도 수학적으로 알고리즘의 유효성을 보일 수 있었음)
Note: 일반적으로 value iteration이 policy iteration보다 빠르게 수렴한다.

Policy iteration vs. value iteration

Policy evaluation에서는 초기 state-value를 이용해서 주어진 policy에 대응하는 state-value를 iterative하게 찾는다. \(v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_{\pi} \rightarrow v_{1'} \rightarrow \cdots \rightarrow v_{\pi'} \rightarrow \cdots \rightarrow v_{*}\)
Value iteration에서는 초기 state-value를 이용해서 최적의 policy에 대응하는 state-value를 iterative하게 찾는다. \(v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_{*}\)

Summary

Asynchronous Dynamic Programming

Schematic of synchronous DP: \(x^{(k+1)} = b + Ax^{(k)}\), where \(x^{(k)}, b \in \mathbb{R}^{|S|\times 1}\) and \(A \in \mathbb{R}^{|S|\times |S|}\)

즉, 벡터 \(x\)의 모든 요소들을 한 번에 업데이트 한다.
만약 해당 벡터가 너무 커서 메모리에 한 번에 못 담을 정도라면? Synchronous DP를 쓰기 힘들다.

Asynchronous DP

간단히 말해, 벡터 \(x\)의 요소들을 굳이 한 번에 업데이트할 필요가 없고, 아무런 순서로 업데이트해도 async. DP를 쓰면 수렴이 보장된다.
예를 들어, \(x^{(k+1)} = b + Ax^{(k)}\) 식에서, \(A\) 행렬의 첫 row와 \(x^{(k)}\) 벡터를 이용해 \(x^{(k+1)}\)의 첫 원소를 업데이트한다. 즉, 이 경우 \(x^{(k+1)}\)의 첫 원소를 제외한 나머지 원소는 \(x^{(k)}\)와 같이 내버려둔다. 이 과정을 반복해서 \(x\)의 첫 원소부터 수렴하게 만들 수 있다. 그 뒤에 \(x\)의 두번째 원소를 위 과정을 반복해서 또 수렴하게 만들 수 있다. 이게 async. DP의 기본적인 아이디어이다.
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected
Three simple strategies:

In-place DP
Prioritized sweeping
Real-time DP

Proofs on the Convergence of Policy and Value Iterations via Banach Fixed Point Theorem (a.k.a. Contraction Mapping Theorem)

Policy evaluation, policy iteration과 value iteration의 수렴을 증명하기 위한 정리. 엄밀한 증명을 위해서는 해석학 지식 소요.

참고 자료: Lecture 16: Value Iteration, Policy Iteration and Policy Gradient

Note: 이 정리를 이해하기 위해서 굳이 Cauchy sequence와 complete space의 개념에 대해 수학적으로 엄밀하게 이해할 필요는 없다. Complete space는 대충 닫힌 (closed) 공간이고 유계인 (bounded) 공간이다.
Note: Metric space도 점들 사이의 거리가 정의된 공간이라고 보면 된다. 거리를 정의할 수 없는 공간도 있나?! 싶으면 그냥 자연스럽게 넘어가면 된다. 또는 해석학 공부를 추천한다.
Def. Cauchy Sequence

A sequence \(x_1, x_2, x_, \cdots\) in a metric space \((X, d)\) is called Cauchy if for every positive real number \(r > 0\) there is a positive interger \(N\) such that for all positive integers \(m, n > N\), \(d(x_m, x_n)\) < r.
대충 말하자면, 아무리 작은 수를 \(r\)로 택해도 수열이 계속 진행하다보면 \(r\)보다 서로 간의 차이가 줄어든다. 이런 수열은 처음에는 perturbation이 있을지라도 수열이 진행될 수록 서로 간의 차이가 0에 가까워진다.

Def. Complete Space

A metric space \((X, d)\) is called complete if every Cauchy sequence of points in \(X\) has a limit that is also in \(X\).
참고 자료: Complete metric space - Wikipedia

Def. Contraction Mapping

Let \((X, d)\) be a complete metric space. Then a map \(T: X \rightarrow X\) is called a contraction mapping on \(X\) if there exists \(q \in [0,1)\) such that \(d(T(x), T(y)) \le qd(x,y)\; \forall x, y \in X\)

Def. Banach Fixed Point Theorem

Let \((X, d)\) be a non-empty complete metric space with a contraction mapping \(T: X \rightarrow X\). Then \(T\) admits a unique fixed-point \(x^* \in X\) (i.e., \(T(x^*)=x^*\)). Furthermore, \(x^*\) can be founded as follows:

Start with an arbitrary element \(x_0 \in X\) and define a sequence \({x_n}\) by \(x_n = T(x_{n-1})\) for \(n \ge 1\). Then \(x_n \rightarrow x^*\).

Proof) Policy evaluation의 수렴성

Given a set of countable states \(S\) (i.e., state space), consider the vector space \(\mathcal{V}\) of state-values. That is, \(i\)-th element of a vector \(V \in \mathcal{V}\) is the state-value of \(i\)-th state \(s \in S\).
Then, we can define the infinity-norm on (\mathcal{V}\) as follow:

\(\left \| u-v \right \|_{\infty} = \underset{s\in S}{\max} |u(s)-v(s)|\)

Note that the infinity-norm of a matrix is the maximum absolute row sum.

\(\left \| T \right \|_{\infty} = \underset{1 \le i \le |S|}{\max} \left (\sum_{j=1}^{|S|}|T_{ij}| \right )\)

Policy evaluation requires the following mapping \(M: V \rightarrow \mathcal{R}^{\pi} + \gamma\, T^{\pi}V\) (대충 표기하면 이렇다는 말).
\(\left \| M(U) - M(V) \right \|_{\infty} = \left \|(\mathcal{R}^{\pi} + \gamma\, T^{\pi}U) - (\mathcal{R}^{\pi} + \gamma\, T^{\pi}V) \right \|_{\infty}
\\ = \left \|\gamma\, T^{\pi}(U - V) \right \|_{\infty} \\ \le \gamma \left \|T^{\pi}\right \|_{\infty} \left \|(U - V) \right \|_{\infty} \\ = \gamma \left \|(U - V) \right \|_{\infty}\)

확률은 다 더하면 1이다.

따라서, \(\gamma \in [0, 1)\)인 \(\gamma\)에 대해 policy evaluation은 수렴한다. \(\blacksquare\)

Proof) Value iteration의 수렴성

위와 비슷한 구조로 논의하면,
Value iteration requires the following mapping \(M: V \rightarrow \underset{a \in A}{\max}\mathcal{R}^{a} + \gamma\, T^{a}V\) (대충 표기하면 이렇다는 말).
\(\left \| M(U) - M(V) \right \|_{\infty}
= \left \|(\underset{a \in A}{\max}\mathcal{R}^{a} + \gamma\, T^{a}U) - (\underset{a \in A}{\max}\mathcal{R}^{a} + \gamma\, T^{a}V) \right \|_{\infty}
\\ \le \left \|\underset{a \in A}{\max}\left((\mathcal{R}^{a} + \gamma\, T^{a}U) - (\mathcal{R}^{a} + \gamma\, T^{a}V)\right) \right \|_{\infty}
\\ = \gamma\, \left\|\underset{a \in A}{\max}T^{a}(U-V) \right \|_{\infty}\)
Let \(a^* = \underset{a \in A}{\mathrm{argmax}}\, T^{a}(U-V)\;\)
\(\left \| M(U) - M(V) \right \|_{\infty} \le \gamma \left \|T^{a^*}\right \|_{\infty} \left \|(U - V) \right \|_{\infty} = \gamma \left \|(U - V) \right \|_{\infty}\)
따라서, \(\gamma \in [0, 1)\)인 \(\gamma\)에 대해 value iteration은 수렴한다. \(\blacksquare\)

Proof) Policy iteration의 수렴성

\(\pi'(s) := \underset{a \in A}{\mathrm{argmax}}\; q_{\pi}(s,a) \forall s \in S\) 라고 하자.
만약 \(\pi = \pi'\)라면 Thm. 2에 의해 optimal policy를 찾은 셈이다. 종료.
만약 \(\pi < \pi'\)라면 다시 policy iteration을 수행한다.
이제 policy iteration이 언젠간 끝남을 증명하면 된다. 즉, 계속 strict increasing하지 않음을 보이면 된다.
Monotone convergence theorem 을 쓰면 될 것 같지만 policy가 계속 좋아지긴 하면서도 optimal에 정말 도달하는지를 증명하기는 부족하다. \({\frac{1}{n}}\) 수열이 결국 0에 도달하지는 않는 것처럼.

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 3: Planning by Dynamic Programming)

댓글

댓글 쓰기

이 블로그의 인기 게시물

시작: 블로그의 방향성

CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

기록: 강화학습 독학을 위한 자료들 (Materials for Studying Reinforcement Learning from Scratch)