기록: UCL Course on RL 요약 및 정리 (Lecture 1: Introduction to Reinforcement Learning)
This post covers not only Lecture 1: Introduction to Reinforcement Learning (davidsilver.uk) but also:
- Markov property - Wikipedia
- Markov chain - Wikipedia
- Markov decision process - Wikipedia
- Partially observable Markov decision process - Wikipedia
- Reinforcement learning - Wikipedia
- Lecture 2: Markov Decision Processes (davidsilver.uk)
- Also, the entire course materials can be found on the following page: Teaching - David Silver.
Lecture Goal
- Reinforcement Learning: An Introduction, Sutton and Barto
- Readable, providing good overview and intuitions
- 445 pages
- Algorithms for Reinforcement Learning, Szepesvari
- Concise, rigorous, mathematical
- 98 pages
- No supervisor, only a reward signal
- Feedback (reward) would be delayed, not instantaneous
- The final reward would be determined after several decisions have been already made.
- Actions may have long-term consequences.
- Supervised learning은 data를 받으면 label도 있으니까 바로 feedback이 오지만 RL은 아님
- Time series (sequential, non i.i.d data)
- Supervised learning은 date가 independent and identical distribution에서 추출된다고 가정
- An action affect the subsequent results (agent states, environment states, eventual reward)
- 때로는 구체적인 수식보다 예시부터 보는 게 컨셉 이해에 도움이 된다.
- Defeat the world champion at Go or Chess
- Manage an investment portfolio (quantitative finance)
- Make a humanoid robot walk
- Play many different Atari games better than humans
- An immediate reward \(R_t\) is a scalar feedback signal or a random variable for the feedback.
- 예를 들어, 바둑에서 모든 수의 immediate reward는 바둑 대국이 끝나는 수를 제외하고 0으로 생각할 수 있다.
- 자율주행차의 경우 가드레일에 부딪히는 시점 \(t\) 때의 immediate reward를 음수로 둘 수 있다.
- Def. Reward Hypothesis
- All goals can be described by the maximization of (expected) net reward, which is accumulated across a time interval
- Note: 사실 정의라기 보다는, 이런 조건이라야 RL을 적용할 수 있다는 의미
- Note: Thus, you must first define a scalar reward to apply RL to your (real-world) problem. If a reward metric cannot be well-defined, RL would fail.
- Goal of RL: select actions to maximize total future reward
- 문제는 future reward를 당장 알아내는 게 힘들 단 것
- Actions may have long term consequences
- Reward may be delayed (zero immediate rewards)
- It may be better to sacrifice immediate rewards to gain more long-term rewards.
- \(\therefore\) 예측이나 학습이 필요함
- Note: fancy words, but nothing but data structures
- Observation: a.k.a. input of a RL network
- Action: a.k.a. output of a RL network
- Reward: a.k.a. an essential element of the objective function
- I.e., objective = maximize a weighted sum of reward across a time interval
- History: a complete sequence of observations, actions, and rewards
- A lot of redundant information
- 예를 들어, 바둑을 두는데 이때까지 뒀던 모든 착점의 순서를 기억한다고 하자. 이게 바둑에서 의미있을까? 바둑은 현재 바둑판 위에 올려진 돌의 위치만 알면 전략을 짤 수 있다. 따라서 모든 observations, actions, rewards를 기억하는 건 사실 의미가 없을 가능성이 높고, 다음 전략을 선택하기 위한 정보만 있으면 충분하다.
- State: the information used to determine what happens next
- \(S_t=f(H_t)\)
- Environment state: data representing the environment
- E.g., positions of all black and white stones on a Go table, the numbers of each remaining stone, etc.
- 이를 근거로 environment는 action을 인풋으로 받아서 자신의 state를 변경하고, reward와 observation을 제공한다.
- 언제나 agent가 environment state를 직접 관찰할 수 있는 건 아니다. 실제 문제에서는 environment state를 모르는 일이 허다하다 (e.g., Quant, autonomous driving, etc.).
- Agent는 environment state를 직접 보는 게 아니라, observation과 reward만 받아서 의사 결정을 한다.
- Agent state: data representing the agent
- What would be helpful information for an agent to decide the next stone position? They configure the agent state.
- Markov property
- A sequence of random variables \(X_0, X_1, X_2, ...\) whose possible values form a countable set \(S\) called the state space of the chain satisfies Markov property if and only if
- \(\mathrm{Pr}(X_{n+1}=x | X_1=x_1, X_2=x_2, , ..., X_n=x_n) = \mathrm{Pr}(X_{n+1}=x|X_n=x_n)\), where both conditional probabilities are well-defined.
- Note: 위에서 history와 state의 차이를 설명한 것을 참고하면, environment/agent state는 Markov이다. 즉, the future is independent of the past given the present라서 past history는 던져버려도 무관하다. The present is a sufficient statistic of the future.
- Note: 달리 말하면, Markov property를 만족하기 위해 충분한 정보를 모아야 state를 구성할 수 있다. 예를 들어, 현재 속도만으론 자동차를 조종할 수 없다. 가속도도 알아야 한다.
- Note: 현재까지의 history 자체도 당연히 Makrov다.
Def. Markove Chain (Process)
- A Markov process is a tuple \((S, T)\), where:
- \(S\) is a countable set states with the Markov property,
- \(T(s'|s) = \mathrm{Pr}(S_{t+1}=s'|S_t=s)\) is a state transition probability.
- Note: A conditional probability \(\mathrm{Pr}(A|B)=\frac{\mathrm{Pr}(A\cap B)}{\mathrm{Pr}(B)}\) is well-defined only when the denominator \(\mathrm{Pr}(B)\) is well-defined and positive, obviously.
- The possible values of \(X_i\) form a countable set \(S\) called the state space of the chain.
- For more information on finite state space, stationary distribution, eigenvector.
Def. Markov Reward Process
- A Markov reward process is a tuple \((S, T, R, \gamma)\), where:
- The expected immediate reward at state \(S_t=s\) is \(R(s)=\mathbb{E}[R_{t+1}|S_t=s]\)
- Note: \(S_t\)에서 받는 reward를 \(t+1\) 아래 첨자를 이용해서 표기하는 이유는, agent가 state를 \(S_t\)로 옮긴 뒤 그 신호가 environment에 전달되는 시기에는 time-step이 하나 지났다고 보기 때문이다. 그래서 environment가 agent에게 다시 reward 신호를 보낼 때는 \(t+1\) 시점이다. 저자마다 convention이 약간씩 다르다.
- Note: 본 강의에서는 immediate reward 또한 stochastic한 random variable이라고 본다. 따라서 같은 state에 방문했어도 매번 reward가 다를 수 있다. 그래서 기댓값을 취한다.
- Note: 참고로, 리워드 함수의 정의는 저자마다 약간씩 차이가 있다. 현재 state에서 벗어난 것을 state를 완수했다고 보고 (수업 수강 등) reward를 준다면 그때의 \(R\)은 오직 현재 state만의 함수일 것이다. 그러나 한 state에서 다른 state로 가는 것에 의미를 부여하면 전후 state 모두의 함수여야 한다.
- E.g., \(R(s,s')\) is the immediate reward received after transitioning from state \(s\) to state \(s'\),
- \(\gamma\) is a discount factor, where \(\gamma \in [0,1]\).
- Def. Return
- The return \(G_t\) is the total discounted reward from time-step \(t\).
- \(G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\)
- Why discount?
- \(\gamma\) close to 1 leads to far-sighted evaluation.
- 직관적으로, discount factor는 미래에 받을 reward에 관한 불확실함을 고려하기 위해 사용
- 수학적으로는, discount factor가 time-step이 매우 커질 때 return이 발산하는 것을 막음 (사실 이게 실제 이유, 예를 들어 바둑 게임에서 승부가 날 때까지 시간 매우 많이 걸림), 또한 discount factor가 1이면 MRP에 positive reward cycle이 없어야 하는 등 MRP를 잘 정의하기 위해 추가적인 조건이 더 필요함
- 재무적으로는, 시간에도 가치가 있기 때문에 당장 받는 reward가 미래의 reward보다 가치가 높음 (복리 고려)
- Goal of RL: select actions to maximize the expected return
- So let's take action selection into account:
Def. Markov Decision Process (MDP) and Fully Observable Environments
- Note: RL uses MDPs where the probabilities or rewards are unknown.
- A Markov decision process is a 4-tuple \((S, A, T, R, \gamma)\), where:
- \(S\) is a set of states called the state space,
- \(T(s'|s,a) = \mathrm{Pr}(S_{t+1}=s'|S_t=s, A_t=a)\) is the conditional transition probability that action \(a\) in state \(s\) at time \(t\) will lead to state \(s'\) at time \(t+1\),
- Note: State transfer is a stochastic process described by the probability function \(T\). 자기가 어떤 액션을 취한다고 다 마음대로 되는 것이 아니기 때문에 한 액션에 의한 상태 전이는 확률적일 수 있다. 물론 문제 세팅에 따라서는 확정적일 수도 있다 (바둑이나 체스 등).
- \(A\) is a set of actions called the action space,
- \(R(s,a)\) is the expected immediate reward received at state \(s\) after taking action \(a\).
- Note: \(S, T, R\) are of the environment. A typical RL agent doesn't aware of them beforehand. That is the main distinction between RL and planning and dynamic programming.
- Def. Full observability
- \(S\) is the environment state. The action of an agent directly depends on the environment state. Thus, \(S\) in MDP is also called a fully observable (to the agent) environment.
- Note: Different actions in a state potentially induce the same state transition (e.g., \(s\) to \(s'\)).
- Note: Some processes with countably infinite state and action spaces can be reduced to ones with finite state and action spaces [1].
Def. Partially Observable Markov Decision Process (POMDP) and Partially Observable Environments
- A POMDP is a 6-tuple \((S, A, T, R, \gamma, \Omega, O)\), where:
- \(\Omega\) is a set of observations,
- \(O\) is a set of conditional observation probabilities.
- Observation
- An agent receives an observation \(o \in \Omega\) which depends on the new state of the environment \(s_{t+1}\) and on the taken action \(a_t\) with probability \(O(o|s_{t+1},a_t)\).
- The agent must make decisions under uncertainty of the true environment state.
- A robot with camera vision isn’t told its absolute location.
- A poker-playing agent only observes public cards.
- Nonetheless, the agent may be aware of previous environment states.
- The POMDP framework is general enough to model a variety of real-world sequential decision processes.
Reinforcement Learning In A Nutshell
- Major computational problems in RL
- Prediction
- It predicts how much net reward you will get as a result of taking action.
- So-called policy evaluation: (policy가 정확히 뭔지는 바로 뒤에서 설명)
- Optimization (better known as control in RL contexts)
- It optimizes the strategy of taking action so that it optimizes the expected net reward.
- So-called policy improvement:
- Function approximation
- It approximates unknown probability mass/density functions such as \(T, R,\) and \(O\).
- So-called value function approximation:
- Typical problem settings of RL
- A model \(T(s'|s,a), R(s,a)\) of the environment is known, but an analytic solution is not available.
- Classic games like Go, Chess, Checker, etc. (rules are known but not trivial to solve)
- Only a simulation model of the environment is given.
- A typical scenario of physically-based tasks (e.g., autonomous driving can be guided by Newtonian dynamics)
- Quantitative investment (e.g., Brownian simulation)
- The only way to collect information about the environment is to interact with it.
- A baby learning the world!
- Policy: Agent's behavior function (i.e., 한 상태에서 어떤 액션을 취할지 결정)
- Deterministic policy: \(a = \pi(s)\)
- Stochastic policy: \(\pi(a|s)=\mathrm{Pr(A_t=a|S_t=s)})\)
- 실제 MDP에서 action 선택과 transition이 일어나는 순서도:
- \(s \rightarrow\) policy (what we need to optimize) \(\rightarrow a \rightarrow\) conditional transition probability (what the environment determines) \(\rightarrow s'\)
- Value: How good is each state and/or action
- Expected future reward
- State-value: \(v(s) = \mathbb{E}_{a \sim \pi(s)}[G_t|S_t=s] = \mathbb{E}_{a \sim \pi(s)}[R_{t+1}+\gamma v(S_{t+1})|S_t=s]\)
- 한 상태 \(s\)에서 취할 액션들과 그 액션들이 야기할 다음 상태들을 모두 고려했을 때 즉각적인 보상 \(R_{t+1}\)과 미래의 보상 \(V(s')\)을 합산한 기대값
- Model: Agent's representation of the environment
- \(T(s'|s,a)\), \(R(s,a)\)
- Chess, Go 게임처럼 model이 이미 주어진 경우도 있다.
- 꼭 모든 RL agent가 model을 알아야 하거나 학습해야 하는 것은 아니다.
- A.k.a. model-free RL (앞으로 나올 대부분의 강의에서 model-free를 다룸)
- Sometimes, it is recommended for RL agents to learn the model (approximation).
- A.k.a model-based RL
- RL is like trial-and-error learning.
- Exploration finds more information about the environment.
- Exploitation exploits known information to maximize reward.
- It is usually essential to explore as well as to exploit.
댓글
댓글 쓰기