Home Reinforcement Learning Course from Hugging Face
Post
Cancel

Reinforcement Learning Course from Hugging Face

I use the blog to record my learning procedures of reinforcement learning course from Hugging Face

Unit 1: Introduction to Deep Reinforcement Learning

Concepts:

  • Reinforcement Learning is a computational approach of learning from action. An agent is designed based on the environment interactions with trail and error and receving rewards(+/-) as feedback.
  • The objective function is to maximize the expected cumulative reward.
  • The RL process is a sequence of state, action, reward, and next state.
  • The rewards can be discounted: the early rewards are more probable and predictable then long term future rewards.
  • The optimal policy is necessary to solve an RL problem which decide the action to take given a state. One example is to maximize the expected return.
    • Policy-based methods: Training the plicy directly.
    • Value-based methods: Training the policy by a value function.

Related Glossary:

  • Markov Property
  • Observations: Partial description of the state of the environment.
  • State: Complete description of the state of the world.
  • Actions: Discrete/Continuous Actions.
  • Tasks: Episodic/Continuous.

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)
# We added some parameters to accelerate the training
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)
# Train it for 1,000,000 timesteps
model.learn(total_timesteps=1000000, progress_bar=True, log_interval=100000)
# Save the model
model_name = "LunarLander-v2"
model.save(model_name)

eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

import gym
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub



## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = "" # Need revise

# TODO: Define the name of the environment
env_id = "LunarLander-v2"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])


# TODO: Define the model architecture we used
model_architecture = "PPO"

## TODO: Define the commit message
commit_message = "initialization"

# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)

# Note: if after running the package_to_hub function and it gives an issue of rebasing, please run the following code
# cd <path_to_repo> && git add . && git commit -m "Add message" && git pull
# And don't forget to do a "git push" at the end to push the change to the hub.

Unit 2: Introduction to Q-Learning

To solve the RL problems, a policy is required. There are two ways to get it:

  • Policy-based: training the policy directly.
  • Value-based: Find an optimal value function.
    • Most of the time, an Epsilon-Greedy Policy is used to handle the exploration/exploitation trade-off.
      • The state-value function.
      • The action-value function.
    • The return valud is the expected return. One way to simplify the value estimation is Bellman Equation. It is similar to Bellman-Ford algorithm to me. The main idea is to calculate the value as the sum of immediate reward plus the discounted value of the state that follows.

Two learning strategies:

  • Monte Carlo: randomly generate a sequence of state, action, reward… and calculate the the return at the end of the episode and use it as a target for updating.
  • Temporal Difference Learning: update the value function from a step.

Q-Learning

An off-policy value-based method that uses a Temporal Difference approach to train its action-value function. The epsilon-greedy policy: Instead local optimal action, use epsilon as the possibility to take random action or greedy action.

Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo

Use Neural Network to take a state and approximate Q-values.

The Deep Q-Network: DQN

Input: several frames of games Output: a vector of Q-values for each possible action at that state.

The Deep Q Algorithm

It uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation). It has two phases:

  • Sampling: Store the observed experience tuples in a replay memory.
  • Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step. One main problem of Deep Q-Learning is instability. There are three solutions:
  • Experience Replay to make more efficient use of experiences.
    • Make more efficient use of the experiences during the training.
    • Avoid forgetting previous experiences and reduce the correlation between experiences.
  • Fixed Q-Target to stabilize the training.
    • Use a separate network with fixed parameters for estimating the TD Target.
    • Copy the parameters from our Deep Q-Network every C steps to update the target network.
  • Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.
    • Use the DQN Network to select the best action to take for the next state.
    • Use the Target network to calculate the target Q-value of taking that action at the next state.

Bonus Unit 2: Automatic Hyperparameter Tuning with Optuna

Optuna is a library to search for the best hyperparameters automatically.

Unit 4: Policy Gradient with Pytorch

The policy-based method:

  • The idea is to parameterize the policy and maximize the performance of the parameterized policy using gradient ascent.
  • The policy-gradient method is a subclass of the policy-based method.

The advantages and disadvantages of policy-gradient methods

Adv:

  • The simplicity of integration.
  • Policy-gradient methods can learn a stochastic policy.
  • Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces.
  • Policy-gradient methods ahve better convergence properties. Dis:
  • It converges to a local maximum instead of a global optimum.
  • It goes slow: inefficient.
  • It has high variance.

Unit 5: Introduction to Unity ML-Agents

Unity ML-Agent is a toolkit for the game engine Unity to create environments to train the agents.

Unit 6: Actor Critic Methods with Robotics Environments

Policy-gradient methods estimate the weights of the optimal policy using Gradient Ascent which means it chooses the steepest increase in return.

Actor-Critic methods

A hybrid architecture combing Value-based and Policy-based methods to stabilize the training by reducing the variance.

  • An actor to control how the agent behaves.
  • A critic to measure how good the action is. So there are two function approximations:
  • A policy that control how our agent acts: $\pi_\theta(s)$.
  • A value function to assist the policy update by measuring how good the action taken is: $q_w(s,a)$

Unit 7: Introduction to Multi-Agents And AI vs AI

Decentralized approach:

  • Treat all agents independently without considering the existence of the other agents.
  • All agents consider others agents as part of the environment.
  • No guarantee of convergence. Centralized approach:
  • A single policy is learned from all the agents.
  • Takes as input the present state of an environment and the policy outputs joint actions.
  • The reward is global.

Unit 8: Proximal Policy Optimization /with Doom

Proximal Policy Optimization (PPO): An architecture that improves the agent’s training stability by avoiding large policy updates. Two reasons that we avoid large policy update:

  • Small update is more likely to converge to an optimal solution.
  • Big step can result in off the cliff and take a long time to recover.

Clipped Surrogate Objective Function

\(L^{CLIP}(\theta)=E_t[min(r_t(\theta)A_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)A)_t)]\) where $r_t(\theta)$ denotes the probability ratio between old and current policy.

Bonus Unit 3: Advanced Topics In Reinforcement Learning

  1. Model-based Reinforcement Learning: learning a model of said environment, and then leveraging the model for control (making decisions).

  2. Offline vs Online Reinforcement Learning

  3. Reinforcement Learning from Human Feedback (RLHF): It a methodology for integrating human data labels into a RL-based optimization process. It is motivated by the challenge of modeling human preferences. - Useful link: https://huggingface.co/blog/rlhf
  4. Decision Transformers: instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return.
  5. Language models in RL:
  6. Curriculum Learning for RL
This post is licensed under CC BY 4.0 by the author.

Paper Summary 2023

[DSN2023] On Adversarial Robustness of Point Cloud Semantic Segmentation