I use the blog to record my learning procedures of reinforcement learning course from Hugging Face
Unit 1: Introduction to Deep Reinforcement Learning
Concepts:
- Reinforcement Learning is a computational approach of learning from action. An agent is designed based on the environment interactions with trail and error and receving rewards(+/-) as feedback.
 - The objective function is to maximize the expected cumulative reward.
 - The RL process is a sequence of state, action, reward, and next state.
 - The rewards can be discounted: the early rewards are more probable and predictable then long term future rewards.
 - The optimal policy is necessary to solve an RL problem which decide the action to take given a state. One example is to maximize the expected return.
    
- Policy-based methods: Training the plicy directly.
 - Value-based methods: Training the policy by a value function.
 
 
Related Glossary:
- Markov Property
 - Observations: Partial description of the state of the environment.
 - State: Complete description of the state of the world.
 - Actions: Discrete/Continuous Actions.
 - Tasks: Episodic/Continuous.
 
Solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()
import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)
# We added some parameters to accelerate the training
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)
# Train it for 1,000,000 timesteps
model.learn(total_timesteps=1000000, progress_bar=True, log_interval=100000)
# Save the model
model_name = "LunarLander-v2"
model.save(model_name)
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
import gym
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from huggingface_sb3 import package_to_hub
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = "" # Need revise
# TODO: Define the name of the environment
env_id = "LunarLander-v2"
# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])
# TODO: Define the model architecture we used
model_architecture = "PPO"
## TODO: Define the commit message
commit_message = "initialization"
# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)
# Note: if after running the package_to_hub function and it gives an issue of rebasing, please run the following code
# cd <path_to_repo> && git add . && git commit -m "Add message" && git pull
# And don't forget to do a "git push" at the end to push the change to the hub.
Unit 2: Introduction to Q-Learning
To solve the RL problems, a policy is required. There are two ways to get it:
- Policy-based: training the policy directly.
 - Value-based: Find an optimal value function.
    
- Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
        
- The state-value function.
 - The action-value function.
 
 - The return valud is the expected return. One way to simplify the value estimation is Bellman Equation. It is similar to Bellman-Ford algorithm to me. The main idea is to calculate the value as the sum of immediate reward plus the discounted value of the state that follows.
 
 - Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
        
 
Two learning strategies:
- Monte Carlo: randomly generate a sequence of state, action, reward… and calculate the the return at the end of the episode and use it as a target for updating.
 - Temporal Difference Learning: update the value function from a step.
 
Q-Learning
An off-policy value-based method that uses a Temporal Difference approach to train its action-value function. The epsilon-greedy policy: Instead local optimal action, use epsilon as the possibility to take random action or greedy action.
Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
Use Neural Network to take a state and approximate Q-values.
The Deep Q-Network: DQN
Input: several frames of games Output: a vector of Q-values for each possible action at that state.
The Deep Q Algorithm
It uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation). It has two phases:
- Sampling: Store the observed experience tuples in a replay memory.
 - Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step. One main problem of Deep Q-Learning is instability. There are three solutions:
 - Experience Replay to make more efficient use of experiences.
    
- Make more efficient use of the experiences during the training.
 - Avoid forgetting previous experiences and reduce the correlation between experiences.
 
 - Fixed Q-Target to stabilize the training.
    
- Use a separate network with fixed parameters for estimating the TD Target.
 - Copy the parameters from our Deep Q-Network every C steps to update the target network.
 
 - Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.
    
- Use the DQN Network to select the best action to take for the next state.
 - Use the Target network to calculate the target Q-value of taking that action at the next state.
 
 
Bonus Unit 2: Automatic Hyperparameter Tuning with Optuna
Optuna is a library to search for the best hyperparameters automatically.
Unit 4: Policy Gradient with Pytorch
The policy-based method:
- The idea is to parameterize the policy and maximize the performance of the parameterized policy using gradient ascent.
 - The policy-gradient method is a subclass of the policy-based method.
 
The advantages and disadvantages of policy-gradient methods
Adv:
- The simplicity of integration.
 - Policy-gradient methods can learn a stochastic policy.
 - Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces.
 - Policy-gradient methods ahve better convergence properties. Dis:
 - It converges to a local maximum instead of a global optimum.
 - It goes slow: inefficient.
 - It has high variance.
 
Unit 5: Introduction to Unity ML-Agents
Unity ML-Agent is a toolkit for the game engine Unity to create environments to train the agents.
Unit 6: Actor Critic Methods with Robotics Environments
Policy-gradient methods estimate the weights of the optimal policy using Gradient Ascent which means it chooses the steepest increase in return.
Actor-Critic methods
A hybrid architecture combing Value-based and Policy-based methods to stabilize the training by reducing the variance.
- An actor to control how the agent behaves.
 - A critic to measure how good the action is. So there are two function approximations:
 - A policy that control how our agent acts: $\pi_\theta(s)$.
 - A value function to assist the policy update by measuring how good the action taken is: $q_w(s,a)$
 
Unit 7: Introduction to Multi-Agents And AI vs AI
Decentralized approach:
- Treat all agents independently without considering the existence of the other agents.
 - All agents consider others agents as part of the environment.
 - No guarantee of convergence. Centralized approach:
 - A single policy is learned from all the agents.
 - Takes as input the present state of an environment and the policy outputs joint actions.
 - The reward is global.
 
Unit 8: Proximal Policy Optimization /with Doom
Proximal Policy Optimization (PPO): An architecture that improves the agent’s training stability by avoiding large policy updates. Two reasons that we avoid large policy update:
- Small update is more likely to converge to an optimal solution.
 - Big step can result in off the cliff and take a long time to recover.
 
Clipped Surrogate Objective Function
\(L^{CLIP}(\theta)=E_t[min(r_t(\theta)A_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)A)_t)]\) where $r_t(\theta)$ denotes the probability ratio between old and current policy.
Bonus Unit 3: Advanced Topics In Reinforcement Learning
- 
    
Model-based Reinforcement Learning: learning a model of said environment, and then leveraging the model for control (making decisions).
 - 
    
Offline vs Online Reinforcement Learning
 - Reinforcement Learning from Human Feedback (RLHF): It a methodology for integrating human data labels into a RL-based optimization process. It is motivated by the challenge of modeling human preferences. - Useful link: https://huggingface.co/blog/rlhf
 - Decision Transformers: instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return.
 - Language models in RL:
 - Curriculum Learning for RL