기록: UCL Course on RL 요약 및 정리 (Lecture 6: Value Function Approximation)
This post covers Lecture 6: Value Function Approximation (davidsilver.uk)
- The entire course materials can be found on the following page: Teaching - David Silver.
Lecture Goal
- Learn how to deal with a large-scale (potentially continuous) state space with RL
- Learn value function approximation
- Learn Deep Q-Network (DQN) and batch methods
Credit: Find order of polynomial approximation and LeastSquares for given datapoints |
Value Function Approximation
- Lookup table
- Table store an entry for each lookup index.
- Value-state
- Index = \(\forall s \in S\), where \(S\) is a finite and discrete state space
- Entry = \(V(s)\)
- Action-state
- Or every state-action pair \(s, a\) has an entry \(Q(s,a)\)
- Or every state \(s\) has entries \(Q(s,a)\) for all actions \(a \in A\)
- So far, we have represented a value function by a lookup table.
- Large MDPs
- Memory inefficiency: too many states or actions to store in memory
- Too slow to learn the value of each state individually
- Note: neighboring states might strongly correlate with each other, and we may utilize the characteristic to learn a value function.
- Solution for large MDPs
- Estimate a value function with function approximation
- \(\hat{v}(s,\mathbf{w}) \approx v_{\pi}(s)\)
or \(\hat{q}(s,a,\mathbf{w}) \approx q_{\pi}(s,a)\) - , where \(\mathbf{w}\) is a set (vector) of parameters of the function approximator
- Generalize from seen states to unseen states
- Update parameter \(\mathbf{w}\) using MC or TD learning
Requirements for Value Function Approximators
- RL considers differentiable function approximators, e.g.
- Linear combination of features (linear regressor);
- Neural network (non-linear regressor).
- Furthermore, we require a training method that is suitable for non-stationary and non-i.i.d data.
- I.i.d random variables = independent and identically distributed random variables
- State-/action-value keeps change during every episode and every step.
- Typically, classification models assume stationary and i.i.d data (e.g., CIFAR10).
Ideal Incremental Methods for Updating Value Function Approximators
- Gradient descent
- Goal: find parameter vector \(\mathbf{w}\) minimizing mean-squared error between \(\hat{v}(s,\mathbf{w})\) (i.e., value function approximator) and \(v_{\pi}(s)\) (i.e., true value)
- \(J(\mathbf{w}) = \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))^2]\)
- Gradient descent finds a local mimimum by velocity of \(\Delta \mathbf{w}\), where
- \(\Delta \mathbf{w} = - \frac{1}{2}\alpha \nabla_{\mathbf{w}} J(\mathbf{w})
\\ = \alpha \mathbb{E}_{\pi}[(v_{\pi}(S) - \hat{v}(S, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w})] \)
\(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\) - and a step size hyperparameter \(\alpha\).
- Note: 원래는 policy evaluation 과정에서 lookup table 내의 각 entry를 \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\) 식을 통해 update했다. 이제부터는 lookup table이 아니라 function approximator를 쓸 거고, 그를 업데이트하기 위해 parameter space에서 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update할 거다. 아래 Control with Value Function Approximation 참고
- Repeat this procedure until convergence
- Expect that \(v_{*} \leftarrow \hat{v}\) during policy iteration
- Stochastic gradient descent samples the gradient by a single value-/action-state sample (in this case, \(s \in S\)):
- For every episode and every step in an episode:
- \(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)
\(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\) - Repeat this procedure until convergence
- The expected update is equal to the full gradient update.
- In practice, the true value function \(v_{\pi}\) is not given at all!
- Feature vector
- Represents a state by a vector of features
- For example:
- Coordinates, velocity, and acceleration of a robot
- Stock prices in the stock market
- Piece and pawn configurations in chess
- Linear model
- \(\hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})^T \mathbf{w}\)
\(\therefore \nabla_{\mathbf{w}} \hat{v}(s,\mathbf{w}) = (\mathrm{feature\; vector})\) - Note: \(\Delta \mathbf{w} = \alpha (v_{\pi}(s) - \hat{v}(s, \mathbf{w}))\nabla_{\mathbf{w}}\hat{v}(s,\mathbf{w}) \)
- In practice, the true value function \(v_{\pi}\) is not given at all!!
Incremental Prediction Methods for Updating Value Function Approximators
- RL is not supervised learning as we assumed above (i.e., we don't know the true value samples \(v_{\pi}\)).
- In practice, we substitute a MC/TD target for \(v_{\pi}\).
- That is, the target dataset is also noisy in the sense of supervised learning!
- Even worse, a MC/TD target is non-stationary and non-i.i.d..
- 참고로 이때의 MC/TD target은 model parameter \(\mathbf{w}\)에 대해 상수라고 놓고 gradient를 계산한다. 즉 \(\nabla_{\mathbf{w}}G_t\) 등은 무시한다. Target도 model parameter의 함수라 두고 gradient를 계산하는 알고리즘도 있지만, 단순하게 미분하면 오답으로 수렴한다.
- Linear model이 아니면 \(\mathbf{x}(S_t)\) 자리에 \(\nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w})\) 넣으면 된다.
- 이때의 eligibility trace는 TD에서 배웠던 state별 trace와는 다르다. 이건 오직 model의 parameter \(\mathbf{w}\)를 update하기 위한 eligibility trace이다. 지금껏 배웠던 state별 trace를 계산할 때는 특정 state에 도착할 때 eligibility를 상승시키기 위해서 \(\mathbf{1}(S_t=s)\) 값을 고려하지만, function approximator에서는 어떤 state에 있느냐에 상관없이 항상 parameter 전체를 update해야 하므로, \(\mathbf{1}(\cdot)\)는 필요 없다.
- TD에서 배웠던 state trace:
- Forward와 backward의 \(\sum_{t=1}^{\infty}\Delta \mathbf{w}_t\)를 비교하면 같다.
- Monte Carlo with value function approximation
- An MC target, return \(G_t\), is an unbiased estimator of true value \(v_{\pi}\).
- Though the return is unbiased, it is so noisy that MC evaluation converges to a local optimum even when using a non-linear value function approximator.
- Temporal-difference with value function approximation
- Though a TD target is biased, linear TD(0) converges (close) to the global optimum.
- Action-value function approximation도 비슷하게 해주면 된다.
Control with Value Function Approximation
- 별거 없고, \(\epsilon\)-greedy 써서 policy improvement하면 된다.
- Repeat for each episode:
- Choose a starting state \(S\) randomly
- Choose an action from the state using policy derived from the (initialized) action-value approximator (\(\epsilon\)-greedy)
- Repeat until terminal:
- Take action, observe a reward and a next state \(S'\)
- Choose an action from the approximator as above (\(\epsilon\)-greedy)
- Calculate a TD/MC target
- Update the approximator:
- \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)
- as an analogous to \(Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))\)
- Proceed to the next state and action
- On-policy SARSA 참고:
Parameter Convergence of Prediction Algorithms
- TD는 on-policy + non-linear 케이스와 off-policy 케이스 때 policy evaluation iteration 중 value approximator update가 발산하는 예가 있다. Empirical하게는 안전 (MC는 이론상 항상 안전).
- Gradient TD는 이 문제를 완전히 해결한다.
Parameter Convergence of Control Algorithms
Batch Prediction Methods for Updating Value Function Approximators
- Stochastic gradient descent에서는 한 번 \(\mathbf{w} \leftarrow \mathbf{w} + \Delta \mathbf{w}\)로 update하고 나면, 해당 transition에 관한 정보를 버렸다. transition에 관한 정보를 모아서 계속 재활용하자는 게 batch RL.
Deep Q-Networks (DQN)
- Q-learning with function approximation
- DQN이 다양한 atari 게임을 잘 학습했던 주 요인 두 가지:
- Experience replay
- 원래 stochastic GD는 sample 하나만 갖고 학습해서 full GD에서 활용하는 expected gradient를 잘 근사할 거라는 가정이 옳아야 잘 동장한다. 그러나 이제껏 배운 것처럼 incremental하게 배우면 각 episode에서 인접한 transition 쌍인 \(S_t, S_{t+1}\)들이 지나치게 서로 연관성이 높아서 학습에 방해된다. 다양한 데이터에 관해 동시에 사용하여 batch gradient를 구해야 제대로 expected gradient를 구할 수 있고, 학습 효율이 증가한다.
- Fixed Q-targets
- Bootstrap(다음 state/action의 value를 추정)용 neural network과 action-value를 학습할 network를 분리한다 (target, behavior network 분리).
- 아래 슬라이드에서 보이는 바와 같이, bootstrap할 때는 이전에 배워둔 네트워크를 (e.g., 1000 step 전의 model paramaters) 이용해 Q target를 계산하고, 막상 Q target을 배우는 건 또 다른 네트워크이다 (e.g., 현재의 model parameters).
- Note: model parameters에 관한 replay buffer를 만든 뒤 그들의 ensemble 모델을 갖고 bootstrap할 수도 있을 듯
- 덕분에 oscilication이나 훈련 과정에서 loss가 폭발하는 일이 없음. 훈련이 안정화.
댓글
댓글 쓰기