I use the blog to record my learning procedures of reinforcement learning course from Hugging Face
Unit 1: Introduction to Deep Reinforcement Learning
Concepts:
- Reinforcement Learning is a computational approach of learning from action. An agent is designed based on the environment interactions with trail and error and receving rewards(+/-) as feedback.
- The objective function is to maximize the expected cumulative reward.
- The RL process is a sequence of state, action, reward, and next state.
- The rewards can be discounted: the early rewards are more probable and predictable then long term future rewards.
- The optimal policy is necessary to solve an RL problem which decide the action to take given a state. One example is to maximize the expected return.
- Policy-based methods: Training the plicy directly.
- Value-based methods: Training the policy by a value function.
Related Glossary:
- Markov Property
- Observations: Partial description of the state of the environment.
- State: Complete description of the state of the world.
- Actions: Discrete/Continuous Actions.
- Tasks: Episodic/Continuous.
Solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()
import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)
# We added some parameters to accelerate the training
model = PPO(
policy = 'MlpPolicy',
env = env,
n_steps = 1024,
batch_size = 64,
n_epochs = 4,
gamma = 0.999,
gae_lambda = 0.98,
ent_coef = 0.01,
verbose=1)
# Train it for 1,000,000 timesteps
model.learn(total_timesteps=1000000, progress_bar=True, log_interval=100000)
# Save the model
model_name = "LunarLander-v2"
model.save(model_name)
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
import gym
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from huggingface_sb3 import package_to_hub
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = "" # Need revise
# TODO: Define the name of the environment
env_id = "LunarLander-v2"
# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])
# TODO: Define the model architecture we used
model_architecture = "PPO"
## TODO: Define the commit message
commit_message = "initialization"
# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
model_name=model_name, # The name of our trained model
model_architecture=model_architecture, # The model architecture we used: in our case PPO
env_id=env_id, # Name of the environment
eval_env=eval_env, # Evaluation Environment
repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
commit_message=commit_message)
# Note: if after running the package_to_hub function and it gives an issue of rebasing, please run the following code
# cd <path_to_repo> && git add . && git commit -m "Add message" && git pull
# And don't forget to do a "git push" at the end to push the change to the hub.
Unit 2: Introduction to Q-Learning
To solve the RL problems, a policy is required. There are two ways to get it:
- Policy-based: training the policy directly.
- Value-based: Find an optimal value function.
- Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
- The state-value function.
- The action-value function.
- The return valud is the expected return. One way to simplify the value estimation is Bellman Equation. It is similar to Bellman-Ford algorithm to me. The main idea is to calculate the value as the sum of immediate reward plus the discounted value of the state that follows.
- Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
Two learning strategies:
- Monte Carlo: randomly generate a sequence of state, action, reward… and calculate the the return at the end of the episode and use it as a target for updating.
- Temporal Difference Learning: update the value function from a step.
Q-Learning
An off-policy value-based method that uses a Temporal Difference approach to train its action-value function. The epsilon-greedy policy: Instead local optimal action, use epsilon as the possibility to take random action or greedy action.
Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
Use Neural Network to take a state and approximate Q-values.
The Deep Q-Network: DQN
Input: several frames of games Output: a vector of Q-values for each possible action at that state.
The Deep Q Algorithm
It uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation). It has two phases:
- Sampling: Store the observed experience tuples in a replay memory.
- Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step. One main problem of Deep Q-Learning is instability. There are three solutions:
- Experience Replay to make more efficient use of experiences.
- Make more efficient use of the experiences during the training.
- Avoid forgetting previous experiences and reduce the correlation between experiences.
- Fixed Q-Target to stabilize the training.
- Use a separate network with fixed parameters for estimating the TD Target.
- Copy the parameters from our Deep Q-Network every C steps to update the target network.
- Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.
- Use the DQN Network to select the best action to take for the next state.
- Use the Target network to calculate the target Q-value of taking that action at the next state.
Bonus Unit 2: Automatic Hyperparameter Tuning with Optuna
Optuna is a library to search for the best hyperparameters automatically.
Unit 4: Policy Gradient with Pytorch
The policy-based method:
- The idea is to parameterize the policy and maximize the performance of the parameterized policy using gradient ascent.
- The policy-gradient method is a subclass of the policy-based method.
The advantages and disadvantages of policy-gradient methods
Adv:
- The simplicity of integration.
- Policy-gradient methods can learn a stochastic policy.
- Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces.
- Policy-gradient methods ahve better convergence properties. Dis:
- It converges to a local maximum instead of a global optimum.
- It goes slow: inefficient.
- It has high variance.
Unit 5: Introduction to Unity ML-Agents
Unity ML-Agent is a toolkit for the game engine Unity to create environments to train the agents.
Unit 6: Actor Critic Methods with Robotics Environments
Policy-gradient methods estimate the weights of the optimal policy using Gradient Ascent which means it chooses the steepest increase in return.
Actor-Critic methods
A hybrid architecture combing Value-based and Policy-based methods to stabilize the training by reducing the variance.
- An actor to control how the agent behaves.
- A critic to measure how good the action is. So there are two function approximations:
- A policy that control how our agent acts: $\pi_\theta(s)$.
- A value function to assist the policy update by measuring how good the action taken is: $q_w(s,a)$
Unit 7: Introduction to Multi-Agents And AI vs AI
Decentralized approach:
- Treat all agents independently without considering the existence of the other agents.
- All agents consider others agents as part of the environment.
- No guarantee of convergence. Centralized approach:
- A single policy is learned from all the agents.
- Takes as input the present state of an environment and the policy outputs joint actions.
- The reward is global.
Unit 8: Proximal Policy Optimization /with Doom
Proximal Policy Optimization (PPO): An architecture that improves the agent’s training stability by avoiding large policy updates. Two reasons that we avoid large policy update:
- Small update is more likely to converge to an optimal solution.
- Big step can result in off the cliff and take a long time to recover.
Clipped Surrogate Objective Function
\(L^{CLIP}(\theta)=E_t[min(r_t(\theta)A_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)A)_t)]\) where $r_t(\theta)$ denotes the probability ratio between old and current policy.
Bonus Unit 3: Advanced Topics In Reinforcement Learning
-
Model-based Reinforcement Learning: learning a model of said environment, and then leveraging the model for control (making decisions).
-
Offline vs Online Reinforcement Learning
- Reinforcement Learning from Human Feedback (RLHF): It a methodology for integrating human data labels into a RL-based optimization process. It is motivated by the challenge of modeling human preferences. - Useful link: https://huggingface.co/blog/rlhf
- Decision Transformers: instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return.
- Language models in RL:
- Curriculum Learning for RL