기록: UCL Course on RL 요약 및 정리 (Lecture 1: Introduction to Reinforcement Learning)

This post covers not only Lecture 1: Introduction to Reinforcement Learning (davidsilver.uk) but also:

Markov property - Wikipedia
Markov chain - Wikipedia
Markov decision process - Wikipedia
Partially observable Markov decision process - Wikipedia
Reinforcement learning - Wikipedia
Lecture 2: Markov Decision Processes (davidsilver.uk)
Also, the entire course materials can be found on the following page: Teaching - David Silver.

Lecture Goal

강화학습 문제를 정의합니다.
강화학습 도메인에서 가장 중요한 몇 가지 용어들을 공부합니다 (e.g., Markov Process)

Text Books

Reinforcement Learning: An Introduction, Sutton and Barto

Readable, providing good overview and intuitions
445 pages

Algorithms for Reinforcement Learning, Szepesvari

Concise, rigorous, mathematical
98 pages

Distinct Characteristics of Reinforcement Learning

No supervisor, only a reward signal
Feedback (reward) would be delayed, not instantaneous

The final reward would be determined after several decisions have been already made.
Actions may have long-term consequences.
Supervised learning은 data를 받으면 label도 있으니까 바로 feedback이 오지만 RL은 아님

Time series (sequential, non i.i.d data)

Supervised learning은 date가 independent and identical distribution에서 추출된다고 가정

An action affect the subsequent results (agent states, environment states, eventual reward)

Examples of Reinforcement Learning

때로는 구체적인 수식보다 예시부터 보는 게 컨셉 이해에 도움이 된다.
Defeat the world champion at Go or Chess
Manage an investment portfolio (quantitative finance)
Make a humanoid robot walk
Play many different Atari games better than humans

Objective of RL

An immediate reward \(R_t\) is a scalar feedback signal or a random variable for the feedback.

예를 들어, 바둑에서 모든 수의 immediate reward는 바둑 대국이 끝나는 수를 제외하고 0으로 생각할 수 있다.
자율주행차의 경우 가드레일에 부딪히는 시점 \(t\) 때의 immediate reward를 음수로 둘 수 있다.

Def. Reward Hypothesis

All goals can be described by the maximization of (expected) net reward, which is accumulated across a time interval
Note: 사실 정의라기 보다는, 이런 조건이라야 RL을 적용할 수 있다는 의미

Note: Thus, you must first define a scalar reward to apply RL to your (real-world) problem. If a reward metric cannot be well-defined, RL would fail.

Goal of RL: select actions to maximize total future reward

문제는 future reward를 당장 알아내는 게 힘들 단 것
Actions may have long term consequences
Reward may be delayed (zero immediate rewards)
It may be better to sacrifice immediate rewards to gain more long-term rewards.
\(\therefore\) 예측이나 학습이 필요함

Def. Observation, Action, Reward, Agent State, Environment State

Note: fancy words, but nothing but data structures
Observation: a.k.a. input of a RL network
Action: a.k.a. output of a RL network
Reward: a.k.a. an essential element of the objective function

I.e., objective = maximize a weighted sum of reward across a time interval

History: a complete sequence of observations, actions, and rewards

A lot of redundant information
예를 들어, 바둑을 두는데 이때까지 뒀던 모든 착점의 순서를 기억한다고 하자. 이게 바둑에서 의미있을까? 바둑은 현재 바둑판 위에 올려진 돌의 위치만 알면 전략을 짤 수 있다. 따라서 모든 observations, actions, rewards를 기억하는 건 사실 의미가 없을 가능성이 높고, 다음 전략을 선택하기 위한 정보만 있으면 충분하다.

State: the information used to determine what happens next

\(S_t=f(H_t)\)

Environment state: data representing the environment

E.g., positions of all black and white stones on a Go table, the numbers of each remaining stone, etc.
이를 근거로 environment는 action을 인풋으로 받아서 자신의 state를 변경하고, reward와 observation을 제공한다.
언제나 agent가 environment state를 직접 관찰할 수 있는 건 아니다. 실제 문제에서는 environment state를 모르는 일이 허다하다 (e.g., Quant, autonomous driving, etc.).
Agent는 environment state를 직접 보는 게 아니라, observation과 reward만 받아서 의사 결정을 한다.

Agent state: data representing the agent

What would be helpful information for an agent to decide the next stone position? They configure the agent state.

Def. (Discrete-time) Markov Property

Markov property

A sequence of random variables \(X_0, X_1, X_2, ...\) whose possible values form a countable set \(S\) called the state space of the chain satisfies Markov property if and only if

\(\mathrm{Pr}(X_{n+1}=x | X_1=x_1, X_2=x_2, , ..., X_n=x_n) = \mathrm{Pr}(X_{n+1}=x|X_n=x_n)\), where both conditional probabilities are well-defined.

Note: 위에서 history와 state의 차이를 설명한 것을 참고하면, environment/agent state는 Markov이다. 즉, the future is independent of the past given the present라서 past history는 던져버려도 무관하다. The present is a sufficient statistic of the future.
Note: 달리 말하면, Markov property를 만족하기 위해 충분한 정보를 모아야 state를 구성할 수 있다. 예를 들어, 현재 속도만으론 자동차를 조종할 수 없다. 가속도도 알아야 한다.
Note: 현재까지의 history 자체도 당연히 Makrov다.

Def. Markove Chain (Process)

A Markov process is a tuple \((S, T)\), where:

\(S\) is a countable set states with the Markov property,
\(T(s'|s) = \mathrm{Pr}(S_{t+1}=s'|S_t=s)\) is a state transition probability.

Note: A conditional probability \(\mathrm{Pr}(A|B)=\frac{\mathrm{Pr}(A\cap B)}{\mathrm{Pr}(B)}\) is well-defined only when the denominator \(\mathrm{Pr}(B)\) is well-defined and positive, obviously.
The possible values of \(X_i\) form a countable set \(S\) called the state space of the chain.
For more information on finite state space, stationary distribution, eigenvector.

Def. Markov Reward Process

A Markov reward process is a tuple \((S, T, R, \gamma)\), where:

The expected immediate reward at state \(S_t=s\) is \(R(s)=\mathbb{E}[R_{t+1}|S_t=s]\)

Note: \(S_t\)에서 받는 reward를 \(t+1\) 아래 첨자를 이용해서 표기하는 이유는, agent가 state를 \(S_t\)로 옮긴 뒤 그 신호가 environment에 전달되는 시기에는 time-step이 하나 지났다고 보기 때문이다. 그래서 environment가 agent에게 다시 reward 신호를 보낼 때는 \(t+1\) 시점이다. 저자마다 convention이 약간씩 다르다.
Note: 본 강의에서는 immediate reward 또한 stochastic한 random variable이라고 본다. 따라서 같은 state에 방문했어도 매번 reward가 다를 수 있다. 그래서 기댓값을 취한다.
Note: 참고로, 리워드 함수의 정의는 저자마다 약간씩 차이가 있다. 현재 state에서 벗어난 것을 state를 완수했다고 보고 (수업 수강 등) reward를 준다면 그때의 \(R\)은 오직 현재 state만의 함수일 것이다. 그러나 한 state에서 다른 state로 가는 것에 의미를 부여하면 전후 state 모두의 함수여야 한다.
E.g., \(R(s,s')\) is the immediate reward received after transitioning from state \(s\) to state \(s'\),

\(\gamma\) is a discount factor, where \(\gamma \in [0,1]\).

Def. Return

The return \(G_t\) is the total discounted reward from time-step \(t\).
\(G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\)

Why discount?

\(\gamma\) close to 1 leads to far-sighted evaluation.
직관적으로, discount factor는 미래에 받을 reward에 관한 불확실함을 고려하기 위해 사용
수학적으로는, discount factor가 time-step이 매우 커질 때 return이 발산하는 것을 막음 (사실 이게 실제 이유, 예를 들어 바둑 게임에서 승부가 날 때까지 시간 매우 많이 걸림), 또한 discount factor가 1이면 MRP에 positive reward cycle이 없어야 하는 등 MRP를 잘 정의하기 위해 추가적인 조건이 더 필요함
재무적으로는, 시간에도 가치가 있기 때문에 당장 받는 reward가 미래의 reward보다 가치가 높음 (복리 고려)

Goal of RL: select actions to maximize the expected return

So let's take action selection into account:

Def. Markov Decision Process (MDP) and Fully Observable Environments

Note: RL uses MDPs where the probabilities or rewards are unknown.
A Markov decision process is a 4-tuple \((S, A, T, R, \gamma)\), where:

\(S\) is a set of states called the state space,
\(T(s'|s,a) = \mathrm{Pr}(S_{t+1}=s'|S_t=s, A_t=a)\) is the conditional transition probability that action \(a\) in state \(s\) at time \(t\) will lead to state \(s'\) at time \(t+1\),

Note: State transfer is a stochastic process described by the probability function \(T\). 자기가 어떤 액션을 취한다고 다 마음대로 되는 것이 아니기 때문에 한 액션에 의한 상태 전이는 확률적일 수 있다. 물론 문제 세팅에 따라서는 확정적일 수도 있다 (바둑이나 체스 등).

\(A\) is a set of actions called the action space,
\(R(s,a)\) is the expected immediate reward received at state \(s\) after taking action \(a\).
Note: \(S, T, R\) are of the environment. A typical RL agent doesn't aware of them beforehand. That is the main distinction between RL and planning and dynamic programming.

Def. Full observability

\(S\) is the environment state. The action of an agent directly depends on the environment state. Thus, \(S\) in MDP is also called a fully observable (to the agent) environment.

Note: Different actions in a state potentially induce the same state transition (e.g., \(s\) to \(s'\)).
Note: Some processes with countably infinite state and action spaces can be reduced to ones with finite state and action spaces [1].

Def. Partially Observable Markov Decision Process (POMDP) and Partially Observable Environments

A POMDP is a 6-tuple \((S, A, T, R, \gamma, \Omega, O)\), where:

\(\Omega\) is a set of observations,
\(O\) is a set of conditional observation probabilities.

Observation

An agent receives an observation \(o \in \Omega\) which depends on the new state of the environment \(s_{t+1}\) and on the taken action \(a_t\) with probability \(O(o|s_{t+1},a_t)\).
The agent must make decisions under uncertainty of the true environment state.

A robot with camera vision isn’t told its absolute location.
A poker-playing agent only observes public cards.

Nonetheless, the agent may be aware of previous environment states.
The POMDP framework is general enough to model a variety of real-world sequential decision processes.

Reinforcement Learning In A Nutshell

Major computational problems in RL

Prediction

It predicts how much net reward you will get as a result of taking action.
So-called policy evaluation: (policy가 정확히 뭔지는 바로 뒤에서 설명)

Optimization (better known as control in RL contexts)

It optimizes the strategy of taking action so that it optimizes the expected net reward.
So-called policy improvement:

기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control) (young-inspiration-retrospection.blogspot.com)

Function approximation

It approximates unknown probability mass/density functions such as \(T, R,\) and \(O\).
So-called value function approximation:

기록: UCL Course on RL 요약 및 정리 (Lecture 6: Value Function Approximation) (young-inspiration-retrospection.blogspot.com)

Typical problem settings of RL

A model \(T(s'|s,a), R(s,a)\) of the environment is known, but an analytic solution is not available.

Classic games like Go, Chess, Checker, etc. (rules are known but not trivial to solve)

Only a simulation model of the environment is given.

A typical scenario of physically-based tasks (e.g., autonomous driving can be guided by Newtonian dynamics)
Quantitative investment (e.g., Brownian simulation)

The only way to collect information about the environment is to interact with it.

A baby learning the world!

Major Components of an RL Agent

Policy: Agent's behavior function (i.e., 한 상태에서 어떤 액션을 취할지 결정)

Deterministic policy: \(a = \pi(s)\)
Stochastic policy: \(\pi(a|s)=\mathrm{Pr(A_t=a|S_t=s)})\)
실제 MDP에서 action 선택과 transition이 일어나는 순서도:

\(s \rightarrow\) policy (what we need to optimize) \(\rightarrow a \rightarrow\) conditional transition probability (what the environment determines) \(\rightarrow s'\)

Value: How good is each state and/or action

Expected future reward
State-value: \(v(s) = \mathbb{E}_{a \sim \pi(s)}[G_t|S_t=s] = \mathbb{E}_{a \sim \pi(s)}[R_{t+1}+\gamma v(S_{t+1})|S_t=s]\)

한 상태 \(s\)에서 취할 액션들과 그 액션들이 야기할 다음 상태들을 모두 고려했을 때 즉각적인 보상 \(R_{t+1}\)과 미래의 보상 \(V(s')\)을 합산한 기대값

Model: Agent's representation of the environment

\(T(s'|s,a)\), \(R(s,a)\)
Chess, Go 게임처럼 model이 이미 주어진 경우도 있다.
꼭 모든 RL agent가 model을 알아야 하거나 학습해야 하는 것은 아니다.

A.k.a. model-free RL (앞으로 나올 대부분의 강의에서 model-free를 다룸)

Sometimes, it is recommended for RL agents to learn the model (approximation).

A.k.a model-based RL

기록: UCL Course on RL 요약 및 정리 (Lecture 8: Integrating Learning and Planning)

Exploration and Exploitation

RL is like trial-and-error learning.
Exploration finds more information about the environment.
Exploitation exploits known information to maximize reward.
It is usually essential to explore as well as to exploit.

이 블로그 검색

Log of Inspiration and Retrospection

기록: UCL Course on RL 요약 및 정리 (Lecture 1: Introduction to Reinforcement Learning)

댓글

댓글 쓰기

이 블로그의 인기 게시물

기록: UCL Course on RL 요약 및 정리 (Lecture 8: Integrating Learning and Planning)

기록: UCL Course on RL 요약 및 정리 (Lecture 5: Model-Free Control)