기록: UCL Course on RL 요약 및 정리 (Lecture 9: Exploration and Exploitation)

This post covers not only Lecture 9: Exploration and Exploitation (davidsilver.uk) but also:

'Reinforcement Learning' 카테고리의 글 목록 (yjjo)
The Multi-Armed Bandit Problem and Its Solutions (lilianweng.github.io)
Bayesian Bandits - Thompson Sampling (tistory.com)
The entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

MDP를 풀 때 exploration과 exploitation 중 어느 한 쪽만 사용하면 optimality에 도달하지 못한다 (trade-off). 본 강의에서는 둘 사이의 균형을 맞추는 알고리즘을 소개한다.

Credit: UC Berkeley Intro to AI course lecture 11

Takeaway

Use (Bayes-)UCB algorithms

Exploration vs. Exploitation Dilemma

Exploitation: make the best decision given current information

E.g., go to your favorite restaurant

Exploration: gather more information

E.g., try a new restaurant

Dilemma

The best long-term strategy may involve short-term sacrifices and risks

How does an agent obtain enough information to make the best overall decisions with the lowest sacrifices?

[In-A-Nutshell] Key Principles and Intuitions for Balancing Exploration and Exploitation

Random exploration

E.g., $ϵ$ -greedy: naive exploration adding noise to greedy policy

Optimism in the face of uncertainty

Method 1: optimistic initialization

Initialize action-value function $Q (s, a)$ to some large value
처음부터 initial value를 과하게 큰 값으로 주면, experience를 통해 이미 방문해본 action의 value는 initial value보다 작아진다.
이렇게 아직 선택해보지 않은 액션의 value가 선택해본 액션의 value보다 항상 크다면 exploration을 장려할 수 있다.

Method 2: upper confidence bounds

자주 방문해본 action은 return을 상당히 정확하게 추정할 수 있는 반면, 몇 번 방문해보지 않은 action은 return 추정이 불확실하다. 따라서, 불확실성(uncertainty)을 정량화 한 뒤에, 불확실성이 더 큰 action을 취하도록 하면 exploration을 장려할 수 있다.
불확실성을 정량화하는 도구가 confidence bounds/interval
$\Rightarrow$ UCB algorithm, Bayesian UCB

Method 3: probability matching

Select an action according to the probability that the action is best among other actions
최종 보상을 높이기 위한 아주 직관적인 방법
이것도 upper confidence bounds를 고려한 exploration처럼, 얼마 방문해보지 못한 action은 return의 분산이 크므로, 다른 action보다 평균적인 return이 더 클 수도 있다는 직관을 활용
Bayesian bandit
$\Rightarrow$ Thompson sampling

Cons of the optimism

로봇이나 헬리콥터 등 심각한 음의 reward를 갖는 (접촉사고, agent 파손 등) RL 문제에서는 오히려 아주 안전한 선택을 위해 uncertainty에 비관적인 선택을 해야 하기도 한다.
또한, action space가 무한해서 아무리 exploration을 해도 더 explore해야 하는 space가 남아 있다면 영원히 알고리즘이 수렴하지 못한다.

Information state space search

Exploration is useful because it gains information
Can we quantify the value of information?
If we quantify the value of information, we can trade-off exploration and exploitation optimally.
Bernoulli bandit
$\Rightarrow$ Bayes-adaptive MDP, Gittins indices

Simplified Setting: Bandit

We first develop each of the above principles in several simplified settings, called bandit problems.
But the same principles also apply to the MDP settings.

Multi-Armed Bandit Problem

Great study material (Korean): Introduction to Multi-Armed Bandits (1) (tistory.com)

다섯번째 bullet point는 각 액션이 받을 reward가 static하지 않고 어떤 확률 분포에 따라 샘플링된다는 걸 뜻한다.

Def. Regret

세번째 bullet point에서 기댓값은 (당연히도) 각 time-step에서 어떤 action을 선택할 확률에 (즉, policy) 의해 결정된다: $π (a) = P [a_{t} = a]$ .

문제는, optimal value를 모르기 때문에 regret도 정확히 구할 수 없으므로, 결국 각 액션의 gap을 추정하는 문제로 귀결된다.
직관적으로, 안 좋은 (gap이 큰) 액션을 최대한 적게 뽑고 좋은 액션을 최대한 자주 뽑아야 totol regret이 줄어든다.

Linear or Sublinear Total Regret

$ϵ$ -greedy policy은 대표적인 exploration 방법이다. 그러나 끊임없이 일정한 빈도로 무작위한 액션을 선택하기 때문에, 최고의 액션 하나만을 선택하지 못하고 끊임없이 sub-optimal 액션 혹은 아주 나쁜 액션을 선택하게 된다. 따라서 선형적으로 total regret이 증가한다.
Greedy policy는 처음에 선택한 (또는 각 액션마다 몇 회의 exploration을 한 뒤 얻은 value의 평균값으로 선택한) 액션이 운좋게 best action이 아닌 이상에야, 계속 total regret이 증가할 수 밖에 없다.
일반적으로 $ϵ$ -greedy의 $ϵ$ 값을 시간이 지남에 따라 점차 줄여 나가면 sublinear total regret을 얻을 수 있다.

Optimistic Initialization

Initialize action-value function $Q (s, a)$ (e.g., approximation function, tabular data structure, etc.) larger than the maximum possible reward.

E.g., $Q (s, a) \leftarrow \frac{r_{max}}{1 - γ} \forall s, a$
Note: $r + γ r + γ^{2} r + \dots = \frac{r}{1 - γ}$ if $0 \leq γ < 1$

It encourages an agent to explore unproven actions.
아직 선택해보지 않은 액션의 value는 선택해본 액션의 value보다 항상 크므로, exploration을 장려할 수 있다.
But can still lock onto suboptimal action

초반에만 exploration을 하고, 시간이 지난 후에는 exploitation만 수행하게 됨
Greedy + optimistic init. $\to$ linear total regret
$ϵ$ -greedy + optimistic init. $\to$ linear total regret

Lower Bound of Total Regret over Time

$\Rightarrow$ 의미: 잘해봐야 regret은 시간에 대해 지수적으로 증가한다. 혹은, 지수적으로 증가하는 exploitation-exploration 알고리즘을 만들었다면 꽤 잘하는 축에 드는 것이다.
분자: action끼리 서로 reward 평균값이 다를수록 total regret 상승
분모: action끼리 서로 reward 분포가 비슷할수록 total regret 상승
$\Rightarrow$ (직관적으로도) 차선과 최선의 action이 모양은 비슷하나 평균이 크게 다른 reward 분포를 가지면, 차선과 최선의 action을 구분하기 어려워 total regret을 줄이기 힘들다.

Decaying $ϵ_{t}$ -Greedy Algorithm

결국 실전에서는 위 식을 정확히 활용할 수는 없다.
대충 $ϵ_{t} = 1 / t$ 로 설정하면 GLIE를 만족하여 Monte-Carlo control이 optimal에 수렴 (기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)).

Upper Confidence Bounds

자주 방문해본 action은 return을 상당히 정확하게 추정할 수 있는 반면, 몇 번 방문해보지 않은 action은 return 추정이 불확실하다. 따라서, 불확실성(uncertainty)을 정량화 한 뒤에, 불확실성이 더 큰 action을 취하도록 하면 exploration을 장려할 수 있다.
이를 수식으로 보충 설명하면 아래와 같다:

Multi-armed bandit 문제에서 action $a$ 를 $N_{t} (a)$ 회 선택했다고 하고 그 return을 ${\hat{Q}}_{t} (a) = \frac{1}{N_{t} (a)} \sum_{τ = 1}^{t} r_{τ} 1 (a_{τ} = a)$ 로 추정하는 상황을 가정하자. Reward 분포는 0과 1사이에서만 정의된다 가정하자 (따라서, bounded distribution에만 적용 가능).
즉, ${\hat{Q}}_{t} (a)$ 는 모평균 $Q (a) = E [r | a]$ 를 근사하는 표본 평균이다.
그러면 아래 Hoeffding의 정리에 따라, 실제 알고자 하는 $Q (a)$ 가 지금까지 샘플링으로 알아낸 ${\hat{Q}}_{t} (a)$ 보다 $U_{t} (a)$ 만큼 클 가능성은 $\exp (- 2 N_{t} (a) U_{t} (a)^{2})$ 보다 낮거나 같다.
달리 말하면, 실제 $Q (a)$ 가 존재할 가능성이 최소 $1 - \exp (- 2 N_{t} (a) U_{t} (a)^{2})$ 만큼 되는 구간을 찾자면, ${\hat{Q}}_{t} (a) + U_{t} (a)$ 이하의 구간이다. 즉, $Q (a) \leq {\hat{Q}}_{t} (a) + U_{t} (a)$ 임을 최소 $1 - \exp (- 2 N_{t} (a) U_{t} (a)^{2})$ 만큼의 자신감으로 주장할 수 있다. 따라서 ${\hat{Q}}_{t} (a) + U_{t} (a)$ 를 upper confidence bound라고 한다.

UCB1 algorithm

Select an action maximizing ${\hat{Q}}_{t} (a) + \sqrt{\frac{2 \log t}{N_{t} (a)}}$ (아래 수식 hat 틀림)

UCB vs. $ϵ$ -greedy

$ϵ$ 을 잘 선택하면 UCB만큼 잘할 수 있다. 그러나 $ϵ$ 을 잘 선택하는 게 어렵다.

Others than Hoeffding's inequality

Hoeffding의 부등식은 확률 변수 $X_{i}$ 에 거의 아무런 조건도 가정하지 않는다. 범용성이 높다는 장점도 있지만, 모든 케이스에서 다 적용되는 부등식이다 보니 bound $\exp (- 2 N_{t} (a) U_{t} (a)^{2})$ 가 너무 헐겁다는 단점도 있다. Reward $R_{t}$ 의 분포에 관해 이미 알고 있는 쓸만한 정보가 있다면 (e.g., Bernoulli), UCB에 다른 부등식 공식을 적용해 더 엄격한 bound를 구할 수 있다:
Bernstein's inequality
Empirical Bernstein's inequality
Chernoff's inequality
Azuma's inequality

Model-Free RL: Upper Confidence Bounds

MDP에도 UCB를 거의 그대로 적용할 수 있느나, 실은 미래에 policy improvement가 진행되면서 얼마나 각각의 action-value가 향상될 것인지도 고려해야 한다. 즉, 아직 탐색하지 않은 action은 향후 policy improvement가 진행되면서 더욱 향상될 가능성이 있다고 보는 것이다.

Model-Based RL: Bayesian UCB

Disclaimer: UCL 강의에서 Bayes 관련 내용들을 (BayesUCB 등) 제대로 설명하지 않고 굉장히 추상적이고 비유적으로만 해석하고 넘어간다.
위 UCB 알고리즘에서는 단순히 시점 $t$ 까지 action $a$ 를 선택한 횟수 $N_{t} (a)$ 만 저장해서 upper confidence bound를 계산하였다.
그리고 Hoeffding의 정리는 확률 변수의 범위를 제외하고는 아무것도 가정하지 않기 때문에 confidence interval이 상당히 성기다. 확률 변수의 모양이나 분포에 관해 좀 더 정확한 정보를 RL experience를 통해 얻을 수 있다면, confidence interval을 좀 더 정확하게 파악할 수 있다.

Csaba Szepesv´ari

Model-based RL에서는 transition probabilities와 rewards 정보를 학습 과정(experience)에서 저장하므로 이걸 이용해보자는 게 Bayesian bandit과 Bayesian UCB
Bayesian UCB

Likelihood: Transition probabilites와 rewards의 모수가 주어질 때 둘은 어떤 분포(e.g., Gaussian, Bernoulli)를 따른다고 가정하자.

Prior: likelihood가 특정 분포라고 할 때, 해당 분포의 모수가 어떤 분포를 띌지도 가정해야 한다. 예를 들어, likelihood를 카테고리 분포로 둔다면, 이 분포의 모수는 0부터 1사이의 값을 가지므로 prior를 디리클레 분포라고 가정할 수 있다.
Posterior: 우리가 원하는 건, model이 추적한 현재까지의 history에 근거했을 때 rewards(와 transition)가 어떤 모수를 가질지 알아내는 것이다. 가장 그럴 듯한 모수를 찾아내면 likelihood의 가정에 따라 reward 분포를 알아낼 수 있다. 베이즈 정리에 따르면, $(posterior) \propto (likelihood) \cdot (prior)$ 이다.
이런 과정을 통해, 현재까지 학습한 정보인 history를 관찰함으로써 rewards와 transition probabilities 분포를 추정할 수 있다.

Bayesian UCB example: independent Gaussians

Suppose reward distribution is Gaussian $R^{a} (r) = N (r; μ_{a}, σ_{a}^{2})$ .

따라서 모수만 정확히 알면 $R^{a} (r) = P [r | a]$ 의 분포를 알 수 있고 그럼 기댓값을 계산할 수 있으므로 $Q (a) = E [r | a]$ 를 알 수 있다. 문제는 모수를 모른다는 것. 따라서 현재까지의 history와 Bayes law를 근거로 이 모수를 추정해보자.

Compute Gaussian posterior by Bayes law:

$(posterior) \propto (likelihood) \cdot (prior)$
$P [μ_{a}, σ_{a}^{2} | h_{t}] \propto P [h_{t} | μ_{a}, σ_{a}^{2}] \cdot P [μ_{a}, σ_{a}^{2}]$
$P [μ_{a}, σ_{a}^{2} | h_{t}] \propto \prod_{t | a_{t} = a} N (r_{t}; μ_{a}, σ_{a}^{2}) \cdot P [μ_{a}, σ_{a}^{2}]$
$h_{t}$ 는 이때까지 어떤 액션과 어떤 reward를 얻었는지에 관한 기록 (action $a$ 에 관한 reward 기록만 있으면 되긴 함)

$P [μ_{a}, σ_{a}^{2} | h_{t}]$ 는 $μ_{a}, σ_{a}^{2}$ 만의 함수 이므로 미분한다거나 해서 posterior를 최대화하는 (즉, 가장 가능성 높아보이는) 모수를 찾는다 (maximum a posteriori).
Pick action that maximize UCB:

$a_{t} = argmax μ_{a} + c σ_{a} / \sqrt{N_{t} (a)}$

Cons of Bayesian UCB

Prior나 likelihood에 관한 가정을 하기 힘들 때, 그냥 UCB를 사용하는 게 낫다.

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 9: Exploration and Exploitation)

댓글

댓글 쓰기

이 블로그의 인기 게시물

기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)

기록: UCL Course on RL 요약 및 정리 (Lecture 8: Integrating Learning and Planning)