기록: UCL Course on RL 요약 및 정리 (Lecture 6: Value Function Approximation)

This post covers Lecture 6: Value Function Approximation (davidsilver.uk) 


Lecture Goal
  • Learn how to deal with a large-scale (potentially continuous) state space with RL
  • Learn value function approximation
  • Learn Deep Q-Network (DQN) and batch methods
Credit: Find order of polynomial approximation and LeastSquares for given datapoints

Value Function Approximation
  • Lookup table
    • Table store an entry for each lookup index.
    • Value-state
      • Index = \(\forall s \in S\), where \(S\) is a finite and discrete state space
      • Entry = \(V(s)\)
    • Action-state
      • Or every state-action pair \(s, a\) has an entry \(Q(s,a)\)
      • Or every state \(s\) has entries \(Q(s,a)\) for all actions \(a \in A\)
    • So far, we have represented a value function by a lookup table.
  • Large MDPs
    • Memory inefficiency: too many states or actions to store in memory
    • Too slow to learn the value of each state individually
      • Note: neighboring states might strongly correlate with each other, and we may utilize the characteristic to learn a value function.
  • Solution for large MDPs
    • Estimate a value function with function approximation
      • \(\hat{v}(s,\mathbf{w}) \approx v_{\pi}(s)\)
        or \(\hat{q}(s,a,\mathbf{w}) \approx q_{\pi}(s,a)\)
      • , where \(\mathbf{w}\) is a set (vector) of parameters of the function approximator
    • Generalize from seen states to unseen states
    • Update parameter \(\mathbf{w}\) using MC or TD learning

Requirements for Value Function Approximators
  • RL considers differentiable function approximators, e.g.
    • Linear combination of features (linear regressor);
    • Neural network (non-linear regressor).
  • Furthermore, we require a training method that is suitable for non-stationary and non-i.i.d data.
    • I.i.d random variables = independent and identically distributed random variables
    • State-/action-value keeps change during every episode and every step.
    • Typically, classification models assume stationary and i.i.d data (e.g., CIFAR10).

Ideal Incremental Methods for Updating Value Function Approximators
  • Gradient descent
    • Goal: find parameter vector \(\mathbf{w}\) minimizing mean-squared error between \(\hat{v}(s,\mathbf{w})\) (i.e., value function approximator) and \(v_{\pi}(s)\) (i.e., true value)
      • \(J(\mathbf{w}) = \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))^2]\)
    • Gradient descent finds a local mimimum by velocity of \(\Delta \mathbf{w}\), where
      • \(\Delta \mathbf{w} = - \frac{1}{2}\alpha \nabla_{\mathbf{w}} J(\mathbf{w})
        \\ = \alpha \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w})] \)
        \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
      • and a step size hyperparameter \(\alpha\).
      • Note: 원래는 policy evaluation 과정에서 lookup table 내의 각 entry를 \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\) 식을 통해 update했다. 이제부터는 lookup table이 아니라 function approximator를 쓸 거고, 그를 업데이트하기 위해 parameter space에서 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update할 거다. 아래 Control with Value Function Approximation 참고
      • Repeat this procedure until convergence
        • Expect that \(v_{*} \leftarrow \hat{v}\) during policy iteration
  • Stochastic gradient descent samples the gradient by a single value-/action-state sample (in this case, \(s \in S\)):
    • For every episode and every step in an episode:
      • \(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)
        \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
    • Repeat this procedure until convergence
    • The expected update is equal to the full gradient update.
  • In practice, the true value function \(v_{\pi}\) is not given at all!

Function Approximator Example: Linear Model
  • Feature vector
    • Represents a state by a vector of features
    • For example:
      • Coordinates, velocity, and acceleration of a robot
      • Stock prices in the stock market
      • Piece and pawn configurations in chess
  • Linear model
    • \(\hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})^T \mathbf{w}\)
      \(\therefore \nabla_{\mathbf{w}} \hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})\)
    • Note: \(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)
  • In practice, the true value function \(v_{\pi}\) is not given at all!!

Incremental Prediction Methods for Updating Value Function Approximators
  • RL is not supervised learning as we assumed above (i.e., we don't know the true value samples \(v_{\pi}\)).
  • In practice, we substitute a MC/TD target for \(v_{\pi}\).
    • That is, the target dataset is also noisy in the sense of supervised learning!
    • Even worse, a MC/TD target is non-stationary and non-i.i.d..
    • 참고로 이때의 MC/TD target은 model parameter \(\mathbf{w}\)에 대해 상수라고 놓고 gradient를 계산한다. 즉 \(\nabla_{\mathbf{w}}G_t\) 등은 무시한다. Target도 model parameter의 함수라 두고 gradient를 계산하는 알고리즘도 있지만, 단순하게 미분하면 오답으로 수렴한다.
    • Linear model이 아니면 \(\mathbf{x}(S_t)\) 자리에 \(\nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w})\) 넣으면 된다.
    • 이때의 eligibility trace는 TD에서 배웠던 state별 trace와는 다르다. 이건 오직 model의 parameter \(\mathbf{w}\)를 update하기 위한 eligibility trace이다. 지금껏 배웠던 state별 trace를 계산할 때는 특정 state에 도착할 때 eligibility를 상승시키기 위해서 \(\mathbf{1}(S_t=s)\) 값을 고려하지만, function approximator에서는 어떤 state에 있느냐에 상관없이 항상 parameter 전체를 update해야 하므로, \(\mathbf{1}(\cdot)\)는 필요 없다.
      • TD에서 배웠던 state trace:
    • Forward와 backward의 \(\sum_{t=1}^{\infty}\Delta \mathbf{w}_t\)를 비교하면 같다.
  • Monte Carlo with value function approximation
    • An MC target, return \(G_t\), is an unbiased estimator of true value \(v_{\pi}\).
    • Though the return is unbiased, it is so noisy that MC evaluation converges to a local optimum even when using a non-linear value function approximator.
  • Temporal-difference with value function approximation
    • Though a TD target is biased, linear TD(0) converges (close) to the global optimum.
  • Action-value function approximation도 비슷하게 해주면 된다.

Control with Value Function Approximation
  • 별거 없고, \(\epsilon\)-greedy 써서 policy improvement하면 된다.
  • Repeat for each episode:
    • Choose a starting state \(S\) randomly
    • Choose an action from the state using policy derived from the (initialized) action-value approximator (\(\epsilon\)-greedy)
    • Repeat until terminal:
      • Take action, observe a reward and a next state \(S'\)
      • Choose an action from the approximator as above (\(\epsilon\)-greedy)
      • Calculate a TD/MC target
      • Update the approximator:
        • \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
        • as an analogous to \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\)
      • Proceed to the next state and action
  • On-policy SARSA 참고:

Parameter Convergence of Prediction Algorithms
  • TD는 on-policy + non-linear 케이스와 off-policy 케이스 때 policy evaluation iteration 중 value approximator update가 발산하는 예가 있다. Empirical하게는 안전 (MC는 이론상 항상 안전).
  • Gradient TD는 이 문제를 완전히 해결한다.

Parameter Convergence of Control Algorithms

Batch Prediction Methods for Updating Value Function Approximators
  • Stochastic gradient descent에서는 한 번 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update하고 나면, 해당 transition에 관한 정보를 버렸다. transition에 관한 정보를 모아서 계속 재활용하자는 게 batch RL.

Deep Q-Networks (DQN)
  • Q-learning with function approximation
  • DQN이 다양한 atari 게임을 잘 학습했던 주 요인 두 가지:
    • Experience replay
      • 원래 stochastic GD는 sample 하나만 갖고 학습해서 full GD에서 활용하는 expected gradient를 잘 근사할 거라는 가정이 옳아야 잘 동장한다. 그러나 이제껏 배운 것처럼 incremental하게 배우면 각 episode에서 인접한 transition 쌍인 \(S_t, S_{t+1}\)들이 지나치게 서로 연관성이 높아서 학습에 방해된다. 다양한 데이터에 관해 동시에 사용하여 batch gradient를 구해야 제대로 expected gradient를 구할 수 있고, 학습 효율이 증가한다.
    • Fixed Q-targets
      • Bootstrap(다음 state/action의 value를 추정)용 neural network과 action-value를 학습할 network를 분리한다 (target, behavior network 분리).
      • 아래 슬라이드에서 보이는 바와 같이, bootstrap할 때는 이전에 배워둔 네트워크를 (e.g., 1000 step 전의 model paramaters) 이용해 Q target를 계산하고, 막상 Q target을 배우는 건 또 다른 네트워크이다 (e.g., 현재의 model parameters).
      • Note: model parameters에 관한 replay buffer를 만든 뒤 그들의 ensemble 모델을 갖고 bootstrap할 수도 있을 듯
    • 덕분에 oscilication이나 훈련 과정에서 loss가 폭발하는 일이 없음. 훈련이 안정화.



댓글

이 블로그의 인기 게시물

기록: UCL Course on RL 요약 및 정리 (Lecture 9: Exploration and Exploitation)

KAIST 회고 (1): 공부법 정리