기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)

8월 17, 2021

This post covers Lecture 5: Model-Free Control (davidsilver.uk)

The entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

Solve an unknown MDP

All information of an MDP is not necessarily given (e.g., reward, transition probabilities, etc.)

Model-free prediction: estimate the value function of an unknown MDP (lecture 4)
Model-free control: optimize the value function of an unknown MDP (this lecture)

On-policy and off-policy learning
GLIE Monte-Carlo control, SARSA, Q-learning

On-Policy Learning vs. Off-Policy Learning

On-policy

"Learn on the job"
Learn about (how to reach) the optimal policy from experience sampled from the entire policy space from scratch

Off-policy

"Look over someone's (or oneself's) shoulder"
Learn about (how to reach) the optimal policy from experience sampled from the given exemplar policy (i.e., behaviors)
exemplar policy = guidance, teacher, advisor (optimal policy에 비해 꼭 낫다는 의미는 아님)
Monte Carlo path tracing 분야에서 비유하자면 importance sampling

Generalized View on Policy Iteration

Policy evaluation: estimate \(v_{\pi}\)

E.g., iterative policy evaluation

Policy improvement: generate better policy (\(\pi' \ge \pi\))

E.g. greedy policy improvement

We can choose any method for each step! What methods are more efficient and effective?

Generalized Policy Iteration with Monte-Carlo Evaluation

Policy evaluation = MC policy evaluation
Policy improvement = Greedy policy improvement?
Problems?

Greedy policy improvement over state-value function requires the model of MDP: Model-free한 문제에서 MC policy evaluation을 state-value function 측정에 적용하면, MDP의 reward와 transition probability를 모르기 때문에 policy improvement를 할 수 없다.
\(\pi'(s) = \underset{a \in A}{\mathrm{argmax}}\; (\mathcal{R}_s^a + \mathcal{T}_{ss'}^aV_{\pi}(s'))\)
Exploration issue on policy improvement: If you improve your policy greedily all the time, you don't guarantee that feasible trajectories explore the entire state space, and there might be a more interesting and better state that you never see.
DP의 경우 policy evaluation 과정에서 state-value나 action-value를 측정할 때 결국 모든 경우의 수를 다 탐방하면서 실제 value를 얻는다. 그래서 policy improvement에서 greedy choice를 해도 문제없다. 그러나 MC처럼 stochastic한 방법을 쓰면 policy evaluation을 할 때 유한한 갯수의 paths만 탐방하여 value를 estimate하기 때문에, 특정 paths들은 아예 value에 반영이 안된다. 따라서 greedy policy improvement를 쓰면 운이 나빠서 value에 반영되지 못한 paths들을 평생 놓치게 된다.

Model-Free Policy Iteration Using Action-Value Function

The action-value function enables policy improvement in a model-free MDP.
\(\pi'(s) = \underset{a \in A}{\mathrm{argmax}}\; q_{\pi}(s,a)\)

\(\epsilon\)-Greedy Policy Improvement

Remedy the exploration issue of the greedy method
Simplest idea for ensuring continual exploration
All actions are tried with non-zero probability
\(\pi'(a|s) = \left\{\begin{matrix} \frac{\epsilon}{m}+1-\epsilon \;\;\; \mathrm{if} \; a=\underset{a' \in A}{\mathrm{argmax}}\; q_{\pi}(s,a') \\ \frac{\epsilon}{m} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \mathrm{otherwise} \hfill \end{matrix}\right.\)

GPI with MC Evaluation with Action-Value Function

Policy evaluation = MC policy evaluation with respect to \(Q = q_{\pi}\)
Policy improvement = \(\epsilon\)-greedy policy improvement

결론: \(\epsilon\)-greedy를 써도 어쨌든 policy가 개선된다.

Monte-Carlo Control

위의 방법론은 policy improvement까지 시간이 너무 많이 걸린다. Value iteration처럼, MC 방법으로 매 episode 마다 value estimates를 update할 때 policy도 같이 update하자.

Every episode (이전 episode를 끝내고 한 termination state에 닿을 때까지):
Policy evaluation = MC policy evaluation with respect to \(Q \approx q_{\pi}\)
Policy improvement = \(\epsilon\)-greedy policy improvement with respect to \(Q\)
Note: 그런데 순수하게 이 방법으로 하면 optimal policy에 도달하지 못한다. Optimality에서는 각 state에서 제일 좋은 action 하나만 취해야 하므로. \(\epsilon\)을 줄여 나가야 한다.

Greedy in the Limit with Infinite Exploration (GLIE)

\(\epsilon\)-greedy control converges to the optimal action-value function under the following conditions on \(\epsilon\) scheduling:

All state-action pairs are explored infinitely many times,

\(\lim_{k \rightarrow \infty} N_k(s,a) = \infty \;\;\; \forall s,a\)

The policy converges on a greedy policy,

\(\lim_{k \rightarrow \infty} \pi_k(a|s) = \mathbf{1}(a= \underset{a' \in A}{\mathrm{argmax}}\; Q_k(s,a'))\)

Note: \(\epsilon=1/k\)로 설정한 \(\epsilon\)-greedy는 GLIE이다.

GLIE Monte-Carlo Control

Note: 첫 줄에 episode는 원래 set이 아니라 순서쌍처럼 순서를 고려할 수 있는 개념으로 정의해야 한다.

MC vs. TD Control

Advantages of TD over MC

Lower variance
Online
Incomplete sequences

Natural idea: use TD instead of MC in our control loop

Apply TD to \(Q(s,a)\)
Use \(\epsilon\)-greedy policy improvement
Update at every time-step
\(\Rightarrow\) SARSA

\(Q(s,a) \leftarrow Q(s,a) + \alpha(R + \gamma Q(s',a') - Q(s,a))\)

SARSA for On-Policy Control

Convergence of SARSA

Sarsa converges to the optimal action-value function under the following conditions:

GLIE sequence of policies \(\pi_t(a|s)\), where \(t\) counts the number of total steps (i.e., 위 알고리즘에서 두번째 repeat에 닿을 때마다 1 추가)
Robbins-Monro sequence of step-sizes \(\alpha_t\), that is:

\(\sum_{t=1}^{\infty}\alpha_t = \infty\)
\(\sum_{t=1}^{\infty}\alpha_t^2 < \infty\)

SARSA(\(\lambda\)) Algorithm (Backward View)

Define the eligibility traces:

\(E_0(s,a) = 0 \\ E_t(s,a) = \gamma \lambda E_{t-1}(s,a) + \mathbf{1}(S_t = s, A_t = a)\)

Then, \(Q(s,a)\) is updated for every state \(s\) and action \(a\) as follow:

\(\delta_t = R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \\
Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a)\)

Off-Policy Learning

Evaluate target policy \(\pi(a|s)\) to compute \(v_{\pi}(s)\) or \(q_{\pi}(s,a)\)
While following exemplar 'behaviour' policy \(\mu(a|s)\)
Use cases

Learn from observing humans or other agents
Re-use experience generated from old policies \(\pi_1, \pi_2, \cdots, \pi_{t-1}\) (i.e., learn from observing itself)
Learn about optimal policy while following exploratory policy

Exploration-exploitation balance

Learn about multiple policies while following one policy

Importance Sampling for Off-Policy MC and TD

Importance sampling

주어진 확률 변수 \(X\)와 그 확률 변수를 추출하는 확률 (밀도 또는 질량) 함수 \(P(X)\)가 있다고 하자. 확률 함수가 복잡하여 확률 함수에서 곧바로 샘플을 추출하기는 (i.e., \(X \sim P\)) 계산적으로 어렵거나 불가능하지만 알고 있는 확률 변수 값 \(x\)이 어떤 확률로 샘플링될지는 (i.e., \(P(X=x)\)) 안다고 할 때, 해당 확률 함수를 이미 알고있는 다른 확률 함수로 (\(Q\)라 하자) 대체하여 확률 변수에 관한 임의의 통계량 \(\mathbb{E}_{X \sim P}[f(X)]\)을 구하는 방법.
\(\mathbb{E}_{X \sim P}[f(X)] = \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)}f(X)] \)
\(Q\)를 어떻게 선택하느냐에 따라 위 기댓값의 추정량(표본 평균)의 분산이 달라진다.

Off-policy

\(\mu\)를 따라 paths를 추출하지만 결국 \(\pi\)의 state-/action-value를 알고 싶은 것이므로, \(\pi\)를 따라 추출할 때의 \(G_t\)의 기댓값이 뭘지 보정해서 구해야 한다.
\(\mathbb{E}_{\pi}[G_t] = \mathbb{E}_{\mu}[\frac{\pi(\cdot)}{\mu(\cdot)}G_t] \)

Off-policy Monte-Carlo

Off-policy temporal difference

Q-Learning

No importance sampling
수식 해석

현재 state에서 action을 추출할 때는 behaviour policy를 따르고, \(G_t\)를 근사할 때는 target policy를 따라 액션을 추출하는 SARSA의 off-policy 버전 변형을 생각해보자. SARSA는 두 액션 모두 target policy에서 추출한다.
\(Q(s,a) \leftarrow Q(s,a\sim \mu) + \alpha(R + \gamma Q(s',a'\sim \pi) - Q(s,a))\)
이때, behaviour과 target policies 모두를 개선한다고 하고, target policy는 action-value greedy로, behaviour policy는 action-value \(\epsilon\)-greedy로 두자. 그러면,

Note: 이렇게 하면 exploration도 적당히 하면서 optimal policy를 찾을 수 있다.

\(R + \gamma Q(s', a'\sim\pi)
\\ = R + \gamma Q(s', \underset{a' \in A}{\mathrm{argmax}}\; Q(s', a'))
\\ = R + \gamma \underset{a' \in A}{\max}Q(s',a') \)
\(Q(s,a) \leftarrow Q(s,a) + \alpha(R + \gamma \underset{a' \in A}{\max}Q(s',a') - Q(s,a))\)

SARSA (revisited)

\(Q(s,a) \leftarrow Q(s,a) + \alpha(R + \gamma Q(s',a') - Q(s,a))\)

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)

댓글

댓글 쓰기

이 블로그의 인기 게시물

시작: 블로그의 방향성

CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

기록: 강화학습 독학을 위한 자료들 (Materials for Studying Reinforcement Learning from Scratch)