Reinforcement Learning Course from Hugging Face

I use the blog to record my learning procedures of reinforcement learning course from Hugging Face

Unit 1: Introduction to Deep Reinforcement Learning

Concepts:

Reinforcement Learning is a computational approach of learning from action. An agent is designed based on the environment interactions with trail and error and receving rewards(+/-) as feedback.
The objective function is to maximize the expected cumulative reward.
The RL process is a sequence of state, action, reward, and next state.
The rewards can be discounted: the early rewards are more probable and predictable then long term future rewards.
The optimal policy is necessary to solve an RL problem which decide the action to take given a state. One example is to maximize the expected return.
- Policy-based methods: Training the plicy directly.
- Value-based methods: Training the policy by a value function.

Related Glossary:

Markov Property
Observations: Partial description of the state of the environment.
State: Complete description of the state of the world.
Actions: Discrete/Continuous Actions.
Tasks: Episodic/Continuous.

Solution:

        
      
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)
# We added some parameters to accelerate the training
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)
# Train it for 1,000,000 timesteps
model.learn(total_timesteps=1000000, progress_bar=True, log_interval=100000)
# Save the model
model_name = "LunarLander-v2"
model.save(model_name)

eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

import gym
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub



## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = "" # Need revise

# TODO: Define the name of the environment
env_id = "LunarLander-v2"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])


# TODO: Define the model architecture we used
model_architecture = "PPO"

## TODO: Define the commit message
commit_message = "initialization"

# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)

# Note: if after running the package_to_hub function and it gives an issue of rebasing, please run the following code
# cd <path_to_repo> && git add . && git commit -m "Add message" && git pull
# And don't forget to do a "git push" at the end to push the change to the hub.

Unit 2: Introduction to Q-Learning

To solve the RL problems, a policy is required. There are two ways to get it:

Policy-based: training the policy directly.
Value-based: Find an optimal value function.
- Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
  - The state-value function.
  - The action-value function.
- The return valud is the expected return. One way to simplify the value estimation is Bellman Equation. It is similar to Bellman-Ford algorithm to me. The main idea is to calculate the value as the sum of immediate reward plus the discounted value of the state that follows.

Two learning strategies:

Monte Carlo: randomly generate a sequence of state, action, reward… and calculate the the return at the end of the episode and use it as a target for updating.
Temporal Difference Learning: update the value function from a step.

Q-Learning

An off-policy value-based method that uses a Temporal Difference approach to train its action-value function. The epsilon-greedy policy: Instead local optimal action, use epsilon as the possibility to take random action or greedy action.

Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo

Use Neural Network to take a state and approximate Q-values.

The Deep Q-Network: DQN

Input: several frames of games Output: a vector of Q-values for each possible action at that state.

The Deep Q Algorithm

It uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation). It has two phases:

Sampling: Store the observed experience tuples in a replay memory.
Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step. One main problem of Deep Q-Learning is instability. There are three solutions:
Experience Replay to make more efficient use of experiences.
- Make more efficient use of the experiences during the training.
- Avoid forgetting previous experiences and reduce the correlation between experiences.
Fixed Q-Target to stabilize the training.
- Use a separate network with fixed parameters for estimating the TD Target.
- Copy the parameters from our Deep Q-Network every C steps to update the target network.
Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.
- Use the DQN Network to select the best action to take for the next state.
- Use the Target network to calculate the target Q-value of taking that action at the next state.

Bonus Unit 2: Automatic Hyperparameter Tuning with Optuna

Optuna is a library to search for the best hyperparameters automatically.

Unit 4: Policy Gradient with Pytorch

The policy-based method:

The idea is to parameterize the policy and maximize the performance of the parameterized policy using gradient ascent.
The policy-gradient method is a subclass of the policy-based method.

The advantages and disadvantages of policy-gradient methods

Adv:

The simplicity of integration.
Policy-gradient methods can learn a stochastic policy.
Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces.
Policy-gradient methods ahve better convergence properties. Dis:
It converges to a local maximum instead of a global optimum.
It goes slow: inefficient.
It has high variance.

Unit 5: Introduction to Unity ML-Agents

Unity ML-Agent is a toolkit for the game engine Unity to create environments to train the agents.

Unit 6: Actor Critic Methods with Robotics Environments

Policy-gradient methods estimate the weights of the optimal policy using Gradient Ascent which means it chooses the steepest increase in return.

Actor-Critic methods

A hybrid architecture combing Value-based and Policy-based methods to stabilize the training by reducing the variance.

An actor to control how the agent behaves.
A critic to measure how good the action is. So there are two function approximations:
A policy that control how our agent acts: $\pi_\theta(s)$.
A value function to assist the policy update by measuring how good the action taken is: $q_w(s,a)$

Unit 7: Introduction to Multi-Agents And AI vs AI

Decentralized approach:

Treat all agents independently without considering the existence of the other agents.
All agents consider others agents as part of the environment.
No guarantee of convergence. Centralized approach:
A single policy is learned from all the agents.
Takes as input the present state of an environment and the policy outputs joint actions.
The reward is global.

Unit 8: Proximal Policy Optimization /with Doom

Proximal Policy Optimization (PPO): An architecture that improves the agent’s training stability by avoiding large policy updates. Two reasons that we avoid large policy update:

Small update is more likely to converge to an optimal solution.
Big step can result in off the cliff and take a long time to recover.

Clipped Surrogate Objective Function

$L^{CLIP}(\theta)=E_t[min(r_t(\theta)A_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)A)_t)]$ where $r_t(\theta)$ denotes the probability ratio between old and current policy.

Bonus Unit 3: Advanced Topics In Reinforcement Learning

Model-based Reinforcement Learning: learning a model of said environment, and then leveraging the model for control (making decisions).
Offline vs Online Reinforcement Learning
Reinforcement Learning from Human Feedback (RLHF): It a methodology for integrating human data labels into a RL-based optimization process. It is motivated by the challenge of modeling human preferences. - Useful link: https://huggingface.co/blog/rlhf
Decision Transformers: instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return.
Language models in RL:
Curriculum Learning for RL