CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

CS 285 (Deep Reinforcement Learning) 요약 및 정리 (Lecture 2: Supervised Learning of Behaviors)

* UC Berkeley CS 285 강의는 David Silver의 UCL Course on RL 강의로 대부분 커버 가능하다. 본 요약 정리 노트에서는 UCL Course on RL에서 다루지 않는 부분만 정리하는 데에 집중한다.

This post covers Lecture 2: Supervised Learning of Behaviors.

Terminology

DL 맥락의 classification 문제는 $a_{t}$ 가 단순히 데이터의 ground truth label이고, $a_{t}$ 가 $o_{t + 1}$ 에 영향을 안준다는 차이점이 있다.
이에 반해, RL 문제나 기타 sequential decision making 문제는, 현재의 액션이 다음 관측에 영향을 준다.

Imitation Learning

Nothing but supervised learning
Good example만 training dataset에 넣는다면, policy network의 오차 등으로 완벽히 좋은 state paths를 따라갈 수 없고, good example에 나오지 않는 새로운 케이스에 대응하는 법을 배우지 못해 goal을 달성하지 못하게 된다.
DAgger: dataset aggregation

Learned policy가 실제로 도달한 state 혹은 observation에서 인간이 손수 바람직한 action을 labeling 한다.
다만, 이 방법은 (3) labeling cost가 너무 크고, 자율주행차를 이렇게 훈련시켰다간 (2) 실제로 차량이 몇 번이고 박살 나야 한다. 사람이 운전하는 것과 매우 다른 observation을 줄 때 사람이 제대로 labeling하기 힘든 점도 있다.

Why might we fail to fit the expert?

Non-Markovian behavior

History 전체를 저장하고 모델 트레이닝에 사용하는 건 cost가 너무 크다.
Causal confusion: 인과 관계를 혼동하는 문제가 생긴다.

Causal Confusion in Imitation Learning [de Haan et al. 2019]

Multi-model behavior: multiple optimal actions for an observation

이 경우, 단순히 평균 action을 취해서는 안된다.
$\Rightarrow$ solutions

Output mixture of Gaussians (easy to implement)
Latent variable models (theoretically rigorous, hard to train)

VAE, flow-based model (RealNVP), etc.

Autoregressive discretization (good balance)

Action-space가 고차원일수록 action 구간을 bins으로 나누어 이산화하는 방법은 차원에 exponential하게 cost가 증가한다. Autoregressive model은 이를 완화한다.

댓글