기록: UCL Course on RL 요약 및 정리 (Lecture 6: Value Function Approximation)

This post covers Lecture 6: Value Function Approximation (davidsilver.uk)

The entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

Learn how to deal with a large-scale (potentially continuous) state space with RL
Learn value function approximation
Learn Deep Q-Network (DQN) and batch methods

Credit: Find order of polynomial approximation and LeastSquares for given datapoints

Value Function Approximation

Lookup table

Table store an entry for each lookup index.
Value-state

Index = \(\forall s \in S\), where \(S\) is a finite and discrete state space
Entry = \(V(s)\)

Action-state

Or every state-action pair \(s, a\) has an entry \(Q(s,a)\)
Or every state \(s\) has entries \(Q(s,a)\) for all actions \(a \in A\)

So far, we have represented a value function by a lookup table.

Large MDPs

Memory inefficiency: too many states or actions to store in memory
Too slow to learn the value of each state individually

Note: neighboring states might strongly correlate with each other, and we may utilize the characteristic to learn a value function.

Solution for large MDPs

Estimate a value function with function approximation

\(\hat{v}(s,\mathbf{w}) \approx v_{\pi}(s)\)
or \(\hat{q}(s,a,\mathbf{w}) \approx q_{\pi}(s,a)\)
, where \(\mathbf{w}\) is a set (vector) of parameters of the function approximator

Generalize from seen states to unseen states
Update parameter \(\mathbf{w}\) using MC or TD learning

Requirements for Value Function Approximators

RL considers differentiable function approximators, e.g.

Linear combination of features (linear regressor);
Neural network (non-linear regressor).

Furthermore, we require a training method that is suitable for non-stationary and non-i.i.d data.

I.i.d random variables = independent and identically distributed random variables
State-/action-value keeps change during every episode and every step.
Typically, classification models assume stationary and i.i.d data (e.g., CIFAR10).

Ideal Incremental Methods for Updating Value Function Approximators

Gradient descent

Goal: find parameter vector \(\mathbf{w}\) minimizing mean-squared error between \(\hat{v}(s,\mathbf{w})\) (i.e., value function approximator) and \(v_{\pi}(s)\) (i.e., true value)

\(J(\mathbf{w}) = \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))^2]\)

Gradient descent finds a local mimimum by velocity of \(\Delta \mathbf{w}\), where

\(\Delta \mathbf{w} = - \frac{1}{2}\alpha \nabla_{\mathbf{w}} J(\mathbf{w})
\\ = \alpha \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w})] \)
\(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
and a step size hyperparameter \(\alpha\).
Note: 원래는 policy evaluation 과정에서 lookup table 내의 각 entry를 \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\) 식을 통해 update했다. 이제부터는 lookup table이 아니라 function approximator를 쓸 거고, 그를 업데이트하기 위해 parameter space에서 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update할 거다. 아래 Control with Value Function Approximation 참고
Repeat this procedure until convergence

Expect that \(v_{*} \leftarrow \hat{v}\) during policy iteration

Stochastic gradient descent samples the gradient by a single value-/action-state sample (in this case, \(s \in S\)):

For every episode and every step in an episode:

\(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)
\(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)

Repeat this procedure until convergence
The expected update is equal to the full gradient update.

In practice, the true value function \(v_{\pi}\) is not given at all!

Function Approximator Example: Linear Model

Feature vector

Represents a state by a vector of features
For example:

Coordinates, velocity, and acceleration of a robot
Stock prices in the stock market
Piece and pawn configurations in chess

Linear model

\(\hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})^T \mathbf{w}\)
\(\therefore \nabla_{\mathbf{w}} \hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})\)
Note: \(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)

In practice, the true value function \(v_{\pi}\) is not given at all!!

Incremental Prediction Methods for Updating Value Function Approximators

RL is not supervised learning as we assumed above (i.e., we don't know the true value samples \(v_{\pi}\)).
In practice, we substitute a MC/TD target for \(v_{\pi}\).

That is, the target dataset is also noisy in the sense of supervised learning!
Even worse, a MC/TD target is non-stationary and non-i.i.d..

참고로 이때의 MC/TD target은 model parameter \(\mathbf{w}\)에 대해 상수라고 놓고 gradient를 계산한다. 즉 \(\nabla_{\mathbf{w}}G_t\) 등은 무시한다. Target도 model parameter의 함수라 두고 gradient를 계산하는 알고리즘도 있지만, 단순하게 미분하면 오답으로 수렴한다.

Linear model이 아니면 \(\mathbf{x}(S_t)\) 자리에 \(\nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w})\) 넣으면 된다.
이때의 eligibility trace는 TD에서 배웠던 state별 trace와는 다르다. 이건 오직 model의 parameter \(\mathbf{w}\)를 update하기 위한 eligibility trace이다. 지금껏 배웠던 state별 trace를 계산할 때는 특정 state에 도착할 때 eligibility를 상승시키기 위해서 \(\mathbf{1}(S_t=s)\) 값을 고려하지만, function approximator에서는 어떤 state에 있느냐에 상관없이 항상 parameter 전체를 update해야 하므로, \(\mathbf{1}(\cdot)\)는 필요 없다.

TD에서 배웠던 state trace:

Forward와 backward의 \(\sum_{t=1}^{\infty}\Delta \mathbf{w}_t\)를 비교하면 같다.

Monte Carlo with value function approximation

An MC target, return \(G_t\), is an unbiased estimator of true value \(v_{\pi}\).
Though the return is unbiased, it is so noisy that MC evaluation converges to a local optimum even when using a non-linear value function approximator.

Temporal-difference with value function approximation

Though a TD target is biased, linear TD(0) converges (close) to the global optimum.

Action-value function approximation도 비슷하게 해주면 된다.

Control with Value Function Approximation

별거 없고, \(\epsilon\)-greedy 써서 policy improvement하면 된다.

Repeat for each episode:

Choose a starting state \(S\) randomly
Choose an action from the state using policy derived from the (initialized) action-value approximator (\(\epsilon\)-greedy)
Repeat until terminal:

Take action, observe a reward and a next state \(S'\)
Choose an action from the approximator as above (\(\epsilon\)-greedy)
Calculate a TD/MC target
Update the approximator:

\(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
as an analogous to \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\)

Proceed to the next state and action

On-policy SARSA 참고:

Parameter Convergence of Prediction Algorithms

TD는 on-policy + non-linear 케이스와 off-policy 케이스 때 policy evaluation iteration 중 value approximator update가 발산하는 예가 있다. Empirical하게는 안전 (MC는 이론상 항상 안전).
Gradient TD는 이 문제를 완전히 해결한다.

Parameter Convergence of Control Algorithms

Batch Prediction Methods for Updating Value Function Approximators

Stochastic gradient descent에서는 한 번 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update하고 나면, 해당 transition에 관한 정보를 버렸다. transition에 관한 정보를 모아서 계속 재활용하자는 게 batch RL.

Deep Q-Networks (DQN)

Q-learning with function approximation
DQN이 다양한 atari 게임을 잘 학습했던 주 요인 두 가지:

Experience replay

원래 stochastic GD는 sample 하나만 갖고 학습해서 full GD에서 활용하는 expected gradient를 잘 근사할 거라는 가정이 옳아야 잘 동장한다. 그러나 이제껏 배운 것처럼 incremental하게 배우면 각 episode에서 인접한 transition 쌍인 \(S_t, S_{t+1}\)들이 지나치게 서로 연관성이 높아서 학습에 방해된다. 다양한 데이터에 관해 동시에 사용하여 batch gradient를 구해야 제대로 expected gradient를 구할 수 있고, 학습 효율이 증가한다.

Fixed Q-targets

Bootstrap(다음 state/action의 value를 추정)용 neural network과 action-value를 학습할 network를 분리한다 (target, behavior network 분리).
아래 슬라이드에서 보이는 바와 같이, bootstrap할 때는 이전에 배워둔 네트워크를 (e.g., 1000 step 전의 model paramaters) 이용해 Q target를 계산하고, 막상 Q target을 배우는 건 또 다른 네트워크이다 (e.g., 현재의 model parameters).
Note: model parameters에 관한 replay buffer를 만든 뒤 그들의 ensemble 모델을 갖고 bootstrap할 수도 있을 듯

덕분에 oscilication이나 훈련 과정에서 loss가 폭발하는 일이 없음. 훈련이 안정화.

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 6: Value Function Approximation)

댓글

댓글 쓰기

이 블로그의 인기 게시물

기록: UCL Course on RL 요약 및 정리 (Lecture 8: Integrating Learning and Planning)

기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)

기록: UCL Course on RL 요약 및 정리 (Lecture 1: Introduction to Reinforcement Learning)