기록: UCL Course on RL 요약 및 정리 (Lecture 7: Policy Gradient)

This post covers not only Lecture 7: Policy Gradient (davidsilver.uk) but also:

Policy Gradient · Fundamental of Reinforcement Learning (gitbooks.io) (모두의 연구소)
Policy Gradient Algorithms (lilianweng.github.io)
Exponential family - Wikipedia
Control variates - Wikipedia
Deterministic Policy Gradient Algorithms (mlr.press)
The entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

Learn policy gradient and actor-critic method
*이번 정리 노트는 David Silver 오리지널 강의 노트와 순서 많이 다름

Why Learn Policy?: Policy-Based Reinforcement Learning

In the last lecture (value function approximation), we approximated the state-/action-value function using model parameters.

E.g., linear model, non-linear neural network, etc.
A policy is implicitly generated and updated from the value function.

E.g., greedy policy improvement

Policy-based RL

Directly parametrize the policy using parameters

\(\pi_{\theta}(s,a)= \mathbb{P}[a|s,\theta]\)w
Note: 이전 강의에서 value function은 policy가 주어졌을 때 정확한 값을 계산해야 하지만 그게 힘드니까 function model을 통해 approximate했던 거고, 이번에는 function model를 policy로 사용하는 거다. 근사하는 게 아니다. Function model의 표현력에 따라 policy의 자유도(state 별로 얼마나 촘촘하게 action의 구성을 바꿀 수 있는가)가 결정된다.
Note: we parametrize a model for value function approximation by \(\mathbf{w}\) in this lecture.

Requirements to learn a policy

How to define a loss function? How to measure the quality of a policy?
How to get/estimate the gradient?

Why learn policy?

Stable convergence

Value-based methods usually oscillate.
예를 들어, 이때까지 배워온 RL 방법론들은 action-value 등을 기반으로 가장 좋은 액션 하나를 선택하는 policy를 썼었다. 따라서 values가 약간 바뀌어서 대소 관계가 역전되면 policy는 크게 바뀐다.
Policy-based methods에서는 policy를 continuous하게 parametrize하여 사용하고 그걸 직접 최적화하므로 부드러운 학습이 가능하다.

Effective in high-dimensional or continuous action spaces

Value는 넓고 연속적인 범위를 배우는 regression이지만, policy는 value보다 compact한 경우가 많다. (e.g., 상하좌우로만 움직이는 로봇의 action space 원소는 고작 4개)

Can learn stochastic policies

Optimal policy는 액션 중 가장 좋은 액션만 택하는 것이지만, feature vector가 environment state를 나타내기에 불충하다면 (일반적인 partially-observable MDP 강화학습 상황, 센서 데이터가 Markov state를 구성하기에 충분하지 않거나, 인간이 임의로 feature 선택해서 넣어주는 경우) 이런 stochastic한 policy는 더욱 중요해진다.

아래와 같이 두 RL agents가 가위바위보를 겨룰 때도 uniform random policy가 최적이다.

Cons

Typically converge to a local rather than the global optimum.
Evaluating a (stochastic) policy is typically inefficient and involves high variance.

We will learn how to reduce the variance.

Discrete action space에서는 효율이 떨어진다.

어느 정도 해결한 논문이 있다: Discrete Action On-Policy Learning with Action-Value Critic

RL in a Nutshell (*는 곧 본 강의에서 다룰 예정)

Value-based

Learned value function
Implicit policy (e.g., greedy)

Policy-based*

No value function
Learned policy

Actor-critic*

Learned value function
Learned policy

Policy Models

Let \(\phi(s,a)\) denotes a feature vector.
Softmax policy

State와 action 모두에 관한 feature vector를 이용해 선형 모델을 세운다 가정하면,
\(\pi_{\theta}(s,a) \propto \exp(\phi(s,a)^T\theta)\)
실제로 action space가 finite하다면 softmax classification처럼 최대값을 주는 action만 취하면 된다.
Action space가 아주 커지면 모델의 파라미터도 매우 커지는 단점이 있다.

Gaussian policy

Softmax와는 다르게 state에 관한 feature vector로 모델을 세우고
\(\pi_{\theta}(s,a) \propto \exp(-\frac{1}{2}(\frac{a - \phi(s)^T\theta}{\sigma})^2)\)
Continuous한 action space를 구축하기 쉽다.

Policy Objective Functions

Goal: given policy \(\pi_{\theta}(s,a)\) with paramaters \(\theta\), find the best \(\theta\).
여러가지 방법으로 policy의 질을 측정할 수 있다.
시작하는 state가 항상 고정된 경우의 objective function (예: 특정 게임):

\(J_1(\theta) = V^{\pi_{\theta}}(s_1) = \mathbb{E}[v_1]\)

첫 state \(s_1\)의 value

아무런 state에서 시작해도 좋은 결과를 줘야하는 상황이라면:

Suppose \(d^{\pi_{\theta}}(s)\) is stationary distribution of Markov chain for \(\pi_{\theta}\).

현재 위치가 어떤 state일 확률이라고 보면 쉽다. 어차피 방문 확률이 높은 state일 수록 해당 state에서 구한 gradient가 stochastic gradient ascent시 영향을 많이 준다.

\(J_{avV}(\theta) = \sum_s{d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)}\)

State에 방문할 확률을 고려한 state-value의 평균

또는,
\(J_{avR}(\theta) = \sum_s{d^{\pi_{\theta}}(s) \sum_a{\pi_{\theta}(s,a)\mathcal{R}(s,a)}} \)

State에 방문할 확률과 action을 고를 확률을 고려한 immediate reward의 평균

위 공식들은 그냥 아이디어 차원에서 설명한 것이고, 본 강의에서는 이외에 여러가지 실전용 variants를 배울 예정

Policy Optimization

이전 강의에서와 같이, objective function의 미분 가능성에 따라서 여러 가지 방법을 쓸 수 있다.
Non-differentiable

Hill climbing
Simplex / amoeba / Nelder Mead
Genetic algorithms

Differentiable (또는 미분 가능한 loss를 쓰면 더 좋은 방법)

Gradient descent
Conjugate gradient
Quasi-newton (a root-finding algorithm)

우리가 다룰 것은 gradient descent

Policy Gradient Theorem

PG에서 쓰이는 objective functions의 gradients를 구해보자.
Consider a simple class of one-step MDPs

Starting in state \(s \sim d(s) \)
Terminating after on time-step with reward \(r = \mathcal{R}(s,a)\)
Then,
\(J(\theta) = \mathbb{E}_{\pi_{\theta}}[r] = \sum_s{ d(s) \sum_a{\pi_{\theta}(s,a)\mathcal{R}(s,a)}}\)
\(\nabla_{\theta}J(\theta) = \sum_s{ d(s) \sum_a{\nabla_{\theta}(\pi_{\theta}(s,a))\mathcal{R}(s,a)}}\)

Policy에 따라 starting distribution \(d(s)\)과 immediate reward \(\mathcal{R}(s,a)\)이 바뀌는 건 아니므로

Note that \(\nabla_{\theta}\pi_{\theta}(s,a) = \pi_{\theta}(s,a) \nabla_{\theta} \log{\pi_{\theta}(s,a)}\)
\(\therefore \nabla_{\theta}J(\theta) = \sum_s{ d(s) \sum_a{\pi_{\theta}(s,a) \nabla_{\theta} (\log{\pi_{\theta}(s,a)}) \mathcal{R}(s,a)}}
\\ = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta} \log{\pi_{\theta}(s,a)} \cdot r]\)
즉, policy objective function의 미분은 log-policy의 미분에 value를 곱한 것의 기대값으로 표현 가능하다.

Note: Value function approximator의 gradient는 굳이 log 관련 공식을 유도하지 않아도 쉽게 구할 수 있다(linear, fully-connected network, etc.). 반면 policy models은 일반적으로 확률 밀도 함수로 표현되고, 확률 밀도 함수는 많은 경우 exponential family에 속하기 때문에 log를 취해 미분하면 미분이 더 쉬워진다.

일반적인 multi-step MDPs에서도 위 공식이 비슷하게 성립한다.

\(r = Q^{\pi_{\theta}}(s,a)\) 등 적당한 (long-term) value로 치환하면,
\(\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta} \log{\pi_{\theta}(s,a)} \cdot Q^{\pi_{\theta}}(s,a)] \)

참고로, 위의 one-step MDPs 케이스와는 다르게, 이 등식의 성립은 전혀 자명하지 않다. \(Q^{\pi_{\theta}}(s,a)\)도 \(\pi_{\theta}\)의 함수이므로 policy에 따라 바뀌기 때문에. 증명은 아주 길다.

Note: PG 정리의 의의는 기술 면접 단골 문제니 꼭 외워두도록 하자.
Proof)

Policy Gradient Algorithms (lilianweng.github.io) 참고

실제로 본 강의에서 배울 알고리즘들의 policy gradients는 다음과 같은 형태로 모두 비슷하다.

Policy Gradient Methods

PG methods measure the quality of policies and their gradients to update parameters of policy approximation models.
Finite-difference PG

Non-differentiable policy model
별로 중요하지 않다. 미분 불가능한 함수의 기울기를 구하기 위해 finite-difference \(\frac{f(a)-f(b)}{a-b}\)를 직접 구하는 방법.

Monte-Carlo PG

Differentiable policy model

Actor-critic PG

Differentiable policy model
Utilize both value and policy approximators
Most practical approach

Monte-Carlo Policy Gradient (REINFORCE)

Note: policy objective function을 정의할 때 쓰이는 \(d(s), \pi_{\theta}(s,a)\)에 관한 항은 어디로 간건지 궁금할 수 있지만, 실은 탐색될 확률 \(d(s)\pi_{\theta}(s,a)\)이 높은 state-action pair는 각 episode와 step에 관한 for 구문을 돌면서 더욱 빈번하게 \(\theta\)의 update에 공헌하므로 이미 \(d(s), \pi_{\theta}(s,a)\)가 고려된 알고리즘이라 볼 수 있다.

Actor-Critic Policy Gradient

Cons of MCPG

기존 MC 방법이 품고있는 문제점 동일
즉, unbiased estimator를 쓰는 대신 variance가 높다.

Critic: action-value function approximator

\(Q_{\mathbf{w}}(s,a) = \phi(s,a)^T\mathbf{w} \approx Q^{\pi_{\theta}}(s,a)\)
Value function approximation 챕터에서 배운 것과 똑같다.
Policy evaluation을 수행하는 꼴

Actor: update policy model's parameters as suggested by the critic

Policy improvement

Policy iteration에서 evaluation과 improvement를 반복해서 진행했던 것처럼, Critic과 Actor를 반복하여 진행

Reducing Variance Using a Baseline

기댓값이 0인 확률 변수를 policy gradient estimator에 가감하여 분산을 낮춘다.

수식의 의미:

State-value fuction \(V^{\pi_{\theta}}(s)\)은 state의 평균적인 value (action에 무관하게)
\(Q^{\pi_{\theta}}\)는 state에서 특정 action의 value
Advantange function \(A^{\pi_{\theta}} \)은 한 action \(a\)가 해당 state에서 취할 수 있는 평균적인 actions보다 얼마나 나은가를 수치화

이론적 배경 참고 자료: Control variates - Wikipedia

Estimating the advantage function

방법 1) State-value function과 action-value function을 각기 다른 parameter를 이용해서 approximation

방법 2) 또는, Bellman equation로 state와 action 간의 관계를 활용하여 둘 중 하나만 approximation (일반적으로 state-value function, 이 예시는 TD(0) 이용)

Actor-Critic at Different Time-Scales

여기서 \(\theta\)는 (linear) state-value function approximator의 parameters이다.

여기서 \(\theta\)는 policy model의 parameters이고, \(v\)는 state-value function approximator의 parameters이다.

여기서 \(\theta\)는 policy model의 parameters이고, \(v\)는 state-value function approximator의 parameters이다.

Bias in Actor-Critic Algorithms

MC PG에서는 unbiased estimator를 사용해서 action-value를 추정했었다.
그러나, AC PG에서는 function approximation을 사용해서 action-value를 근사한다. 그러면 policy gradient가 bias된다.
이런 경우 과연 AC PG가 제대로 된 gradient (true gradient) 방향으로 나아가는 게 맞을까?
다행히도, 아래의 Compatible Function Approximation Theorem에 따르면, value function approximator를 잘 잡고, 그 approximator가 학습할 loss만 잘 설정한다면 진짜 value function을 쓰지 않고 approximator를 쓰더라도 (이상적인 경우) unbiased gradient estimator가 된다는 걸 보여준다 [David Silver et al.].

Note: 다만 실제로 2번의 경우가 잘 성립하지 않는다. Policy iteration 동안 에러 최적화를 시도한다는 거지, 모든 step에서 다 에러가 최적의 상태인 건 아니니까.

Deterministic Policy Gradient

Deterministic policy gradient는 continuous action space나 high-dimensional action space에서 stochastic PG를 능가한다.
Note: 실제 위 정리는 David Silver의 논문인 Deterministic Policy Gradient Algorithms에 나오는 내용이다.
Vanilla PG는 일반적으로 action distribution이 continuous한 것을 가정하여, stochastic한 확률 분포를 활용한다(e.g., Gaussian).
그러나 어떤 액션을 취해야 할지 판단이 힘든 학습 초반에는 이런 stochastic한 특성 때문에 학습이 noisy하다. 이런 단점은 high-dimensional action spaces에서 더욱 두드러진다.
이렇게 policy 자체가 noisy하기 때문에 policy의 gradient를 구하기는 더욱 어려울 수 있다.
그러나 가위바위보 게임 등 몇몇 특수한 케이스를 제외하면 optimal policy는 어차피 각 state에서 택할 수 있는 action 중 가장 좋은 action 하나를 고르는 deterministic function이다 (가위바위보는 가장 좋은 action이 여러 개).
그렇다면 처음부터 그냥 determinstic policy로 시작한다면 어떨까?

Note: 기존 stochastic policy는 예를 들어 Gaussian 분포의 \(\sigma\)를 줄여가면서 deterministic한 policy로 점점 수렴하지만 DPG는 처음부터 deterministic

이게 deterministic policy의 직관이고, deterministic policy를 학습 가능하게 만든 게 해당 논문의 주된 공헌이다.

Natural Actor-Critic

David Silver의 말을 그대로 옮기면, "The natural AC is an alternative descent or ascent direction which has some nice properties, and it falls out very nicely in the actor-critic framework."
요지는, 그냥 NAC 써봐라.
NAC updates actor parameters in direction of critic parameters:

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 7: Policy Gradient)

댓글

댓글 쓰기

이 블로그의 인기 게시물

시작: 블로그의 방향성

CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

기록: 강화학습 독학을 위한 자료들 (Materials for Studying Reinforcement Learning from Scratch)