Fixing Reinforcement Learning

PufferLib is a library for sane and simple reinforcement learning. Our key features are:

This tutorial contains everything you need to start doing RL 100x faster than most of the field. We also highly encourage you to read the PufferLib source. It's not like other RL libraries: PufferLib does what it does in a few thousand lines of very simple code. It is also much easier to get help when you're stuck. Our Discord community is active and helpful.

Installation

CONDA USERS: Be warned, Conda's C compiler is fundamentally broken. We have hacked around it so Ocean envs will at least run, but most will be several times slower than if you install without Conda.

PufferTank (Recommended)

PufferTank is a prebuilt GPU Docker image with PufferLib and dependencies for all environments in the registry, including some that are slow and tricky to install. If you have not used containers before and just want everything to work, clone the repository and open it in VSCode. You will need to install the Dev Container plugin as well as Docker Desktop. VSCode will then detect the settings in .devcontainer and set up the container for you. Neovim (btw) is also included.

git clone https://github.com/pufferai/puffertank
cd puffertank
bash docker.sh test # Run image. Downloads ~12GB on the first run.

Pip Installation (core only, no training demo)

pip install pufferlib
pip install pufferlib[cleanrl,atari] # Extras for this tutorial

Source Installation

git clone https://github.com/pufferai/pufferlib
cd pufferlib
pip install -e .
pip install -e .[cleanrl,atari] # Extras for this tutorial

Training Demo

PufferLib includes a few different training scripts. You will notice that they are in the base repository, not in the pip package. This is because they are based on CleanRL, which is not meant to be packaged. Rather, it is a white-box library that is meant to be copied and edited to suit your needs. Integrating PufferLib only requires changing a few lines:

python cleanrl_ppo_atari.py

This is already much faster than the original CleanRL code, but it is still several times slower than our main training demo. Some basics:

python demo.py
    --mode [train, eval, sweep-carbs]
    --env [env_name]
    --vec [serial, multiprocessing, native, ray]
    --track # Track on Weights and Biases. Set your username in demo.py
    --help # display a full list of options

# Get help. One important arg: --train.device cpu if you don't have a GPU
python demo.py --help

# Get help on a specific environment
python demo.py --help --env puffer_snake

# Train breakout with multiprocessing (24 cores):
python demo.py --mode train --env breakout --vec multiprocessing

# Run a hyperparameter sweep on Ocean pong. Requires carbs (github pufferai/carbs):
python demo.py --mode sweep-carbs --env puffer_pong

# Train Ocean snake with wandb logs:
python demo.py --env puffer_snake --mode train --track

# Set train and env params from cli:
python demo.py --env puffer_snake --mode train --train.learning-rate 0.001 --env.vision 3

# Eval a pretrained baseline model:
python demo.py --env puffer_snake --mode eval --baseline

# Eval an uninitialized policy:
python demo.py --env puffer_snake --mode eval

# Eval a local checkpoint:
python demo.py --env puffer_snake --mode eval --eval-model-path your_model.pt

# Useful for finding your latest checkpoint:
ls -lt experiments | head

Compared to the original CleanRL code, our demo file (which loads clean_pufferl.py) supports asynchronous on-policy vectorization, better multi-agent training, a convenient cli dashboard, better WandB log and sweeps integration, and more. It's only around 1000 lines of code, most of which is logging.

Vectorization

In RL, vectorization refers to the process of simulating multiple copies of an environment in parallel. Our Multiprocessing backend is fast -- much faster than Gymnasium's in most cases. Atari runs 50-60% faster synchronous and 5x faster async by our latest benchmark, and some environments like NetHack can be 10x faster even synchronous, with no API changes. Here's how to create a vectorized environment. Note to mac users: your OS doesn't like to run subprocesses without __main__.



from pufferlib.environments import atari
env_creator = atari.env_creator('breakout')

import pufferlib.vector
vecenv = pufferlib.vector.make(
    env_creator, # A callable (class or function) that returns an env
    env_args: None, # A list of arguments to pass to each environment
    env_kwargs: None, # A list of dictionary keyword arguments to pass to each environment
    backend: Serial, # pufferlib.vector.[Serial|Multiprocessing|Native|Ray]
    num_envs: 1, # The total number of environments to create
    **kwargs # extra backend-specific options
)

# Make 4 copies of Breakout on the current process
vecenv = pufferlib.vector.make(env_creator, num_envs=4,
    backend=pufferlib.vector.Serial)

# Make 4 copies of Breakout, each on a separate process
vecenv = pufferlib.vector.make(env_creator, num_envs=4,
    backend=pufferlib.vector.Multiprocessing)

# Make 4 copies of Breakout, 2 on each of 2 processes
vecenv = pufferlib.vector.make(env_creator, num_envs=4,
    backend=pufferlib.vector.Multiprocessing, num_workers=2)

# Make 4 copies of Breakout, 2 on each of 2 processes,
# but only get two observations per step
vecenv = pufferlib.vector.make(env_creator, num_envs=4,
    backend=pufferlib.vector.Multiprocessing, num_workers=2,
    batch_size=2)

# Make 1024 instances of Ocean breakout on the current process
from pufferlib.ocean import Breakout
vecenv = pufferlib.vector.make(Breakout,
    backend=pufferlib.vector.Native,
    env_kwargs={'num_envs': 1024},
)

# Notice that Native envs handle multiple instances internally.
# You can still multiprocess/async, but don't make multiple external
# copies per process.
vecenv = pufferlib.vector.make(Breakout, num_envs=2,
    backend=pufferlib.vector.Multiprocessing, batch_size=1)

# Synchronous API - reset/step
import time
vecenv = pufferlib.vector.make(Breakout, num_envs=2,
    backend=pufferlib.vector.Multiprocessing)
vecenv.reset()
start, steps, TIMEOUT = time.time(), 0, 3
while time.time() - start < TIMEOUT:
    vecenv.step(vecenv.action_space.sample())
    steps += 1

vecenv.close()
print('Puffer FPS:', steps*vecenv.num_envs/TIMEOUT)

# Async API - async_reset, send/recv
# Call your model between recv() and send()
vecenv = pufferlib.vector.make(Breakout, num_envs=2,
    backend=pufferlib.vector.Multiprocessing, batch_size=1)
vecenv.async_reset()
start, steps, TIMEOUT = time.time(), 0, 3
while time.time() - start < TIMEOUT:
    vecenv.recv()
    vecenv.send(vecenv.action_space.sample())
    steps += 1

vecenv.close()
print('Puffer Async FPS:', steps*vecenv.num_envs/TIMEOUT)

Our vectorization works on almost any Gymnasium/PettingZoo environment, not just the ones we have bound manually. All you have to do is wrap your environment with our Emulation layer, covered in the next section. PufferLib outperforms other vectorization implementations by implementing the following optimizations:

Emulation

Complex environments may have heirarchical observations and actions, variable numbers of agents, and other quirks that make them difficult to work with and incompatible with standard reinforcement learning libraries. PufferLib's emulation layer makes every environment look like it has flat observations/actions and a constant number of agents. Here's how it works with NetHack and Neural MMO, two notoriously complex environments.

pip install -e .[nethack,nmmo]
import pufferlib.emulation
import pufferlib.wrappers

import nle, nmmo

def nmmo_creator():
    env = nmmo.Env()
    env = pufferlib.wrappers.PettingZooTruncatedWrapper(env)
    return pufferlib.emulation.PettingZooPufferEnv(env=env)

def nethack_creator():
    env = nle.env.NLE()
    import shimmy # Gym versioning
    env = shimmy.GymV21CompatibilityV0(env=env)
    return pufferlib.emulation.GymnasiumPufferEnv(env=env)

env = nethack_creator()
obs, _ = env.reset()
structured_obs = obs.view(env.obs_dtype)
print('NetHack observation space:', structured_obs.dtype)
print('Packed shape:', obs.shape)

env = nmmo_creator()
obs, _ = env.reset()
structured_obs = obs[1].view(env.obs_dtype)
print('NMMO observation space:', structured_obs.dtype)
print('Packed shape:', obs[1].shape)

The wrappers give you back a Gymnasium/PettingZoo compliant environment. There is no loss of generality and no change to the underlying environment. You can wrap environments by class, creator function, or object, with or without additional arguments. These wrappers enable us to make some optimizations to vectorization code that would be difficult to implement otherwise.

Policies

You don't want another Policy API so we don't have one. We Just write normal PyTorch code. We do provide:

import torch
from torch import nn
import numpy as np


class Default(nn.Module):
    '''Default PyTorch policy. Flattens obs and applies a linear layer.

    PufferLib is not a framework. It does not enforce a base class.
    You can use any PyTorch policy that returns actions and values.
    We structure our forward methods as encode_observations and decode_actions
    to make it easier to wrap policies with LSTMs. You can do that and use
    our LSTM wrapper or implement your own. To port an existing policy
    for use with our LSTM wrapper, simply put everything from forward() before
    the recurrent cell into encode_observations and put everything after
    into decode_actions.
    '''
    def __init__(self, env, hidden_size=128):
        super().__init__()
        self.hidden_size = hidden_size
        self.is_multidiscrete = isinstance(env.single_action_space,
                pufferlib.spaces.MultiDiscrete)
        self.is_continuous = isinstance(env.single_action_space,
                pufferlib.spaces.Box)
        try:
            self.is_dict_obs = isinstance(env.env.observation_space, pufferlib.spaces.Dict) 
        except:
            self.is_dict_obs = isinstance(env.observation_space, pufferlib.spaces.Dict) 

        if self.is_dict_obs:
            self.dtype = pufferlib.pytorch.nativize_dtype(env.emulated)
            input_size = sum(np.prod(v.shape) for v in env.env.observation_space.values())
            self.encoder = nn.Linear(input_size, self.hidden_size)
        else:
            self.encoder = nn.Linear(np.prod(env.single_observation_space.shape), hidden_size)

        if self.is_multidiscrete:
            action_nvec = env.single_action_space.nvec
            self.decoder = nn.ModuleList([pufferlib.pytorch.layer_init(
                nn.Linear(hidden_size, n), std=0.01) for n in action_nvec])
        elif not self.is_continuous:
            self.decoder = pufferlib.pytorch.layer_init(
                nn.Linear(hidden_size, env.single_action_space.n), std=0.01)
        else:
            self.decoder_mean = pufferlib.pytorch.layer_init(
                nn.Linear(hidden_size, env.single_action_space.shape[0]), std=0.01)
            self.decoder_logstd = nn.Parameter(torch.zeros(
                1, env.single_action_space.shape[0]))

        self.value_head = nn.Linear(hidden_size, 1)

    def forward(self, observations):
        hidden, lookup = self.encode_observations(observations)
        actions, value = self.decode_actions(hidden, lookup)
        return actions, value

    def encode_observations(self, observations):
        '''Encodes a batch of observations into hidden states. Assumes
        no time dimension (handled by LSTM wrappers).'''
        batch_size = observations.shape[0]
        if self.is_dict_obs:
            observations = pufferlib.pytorch.nativize_tensor(observations, self.dtype)
            observations = torch.cat([v.view(batch_size, -1) for v in observations.values()], dim=1)
        else: 
            observations = observations.view(batch_size, -1)
        return torch.relu(self.encoder(observations.float())), None

    def decode_actions(self, hidden, lookup, concat=True):
        '''Decodes a batch of hidden states into (multi)discrete actions.
        Assumes no time dimension (handled by LSTM wrappers).'''
        value = self.value_head(hidden)
        if self.is_multidiscrete:
            actions = [dec(hidden) for dec in self.decoder]
            return actions, value
        elif self.is_continuous:
            mean = self.decoder_mean(hidden)
            logstd = self.decoder_logstd.expand_as(mean)
            std = torch.exp(logstd)
            probs = torch.distributions.Normal(mean, std)
            batch = hidden.shape[0]
            return probs, value

        actions = self.decoder(hidden)
        return actions, value

import pufferlib.vector
from pufferlib.ocean import Breakout
vecenv = pufferlib.vector.make(Breakout, backend=pufferlib.vector.Native)
policy = Default(vecenv.driver_env)
obs, _ = vecenv.reset()
obs = torch.as_tensor(obs)

# Use the PyTorch policy raw
actions, value = policy(obs)

# Use our LSTM compatibility layer
from pufferlib.models import LSTMWrapper
lstm_policy = LSTMWrapper(vecenv.driver_env, policy)
state = (
    torch.zeros(1, 1, lstm_policy.hidden_size),
    torch.zeros(1, 1, lstm_policy.hidden_size),
)
actions, value, state = lstm_policy(obs, state)

# Use our CleanRL API compatibility layer
import pufferlib.cleanrl
cleanrl_policy = pufferlib.cleanrl.Policy(policy)
actions = cleanrl_policy.get_action_and_value(obs)[0].numpy()

# Use our CleanRL LSTM API compatibility layer
cleanrl_lstm_policy = pufferlib.cleanrl.RecurrentPolicy(lstm_policy)
actions = cleanrl_lstm_policy.get_action_and_value(obs)[0].numpy()

obs, rewards, terminals, truncateds, infos = vecenv.step(actions)
vecenv.close()

Remember the unflatten operation in Emulation? Notice our usage of pufferlib.pytorch.nativize_dtype and pufferlib.pytorch.nativize_tensor to unpack structured data in the forward pass. You only need to worry about this if your environment has structured observation data.

Puffer Ocean: First Party Envs & API

Gymnasium and PettingZoo are great, but they cause fundamental performance limitations that cap enviroments to far below 1M steps/second:

PufferLib ships with Ocean, our first-party suite of high-performance environments. They are written with our native PufferEnv API, which eliminates traditional bottlenecks. For debugging new code, Ocean also includes specially designed sanity environments that train in seconds and will let you catch 90% of implementation bugs.

PufferEnv is a vector format that easily scales to several million steps per second. It is very similar to Gymnasium's VectorEnv, but with some elements preserved from PettingZoo to better support multi-agent envs without inconveniencing single-agent use. The Python binding is included below. The full source is available in ocean/squared.

'''A simple sample environment. Use this as a template for your own envs.'''

import gymnasium
import numpy as np

import pufferlib
from pufferlib.ocean.squared.cy_squared import CySquared


class Squared(pufferlib.PufferEnv):
    def __init__(self, num_envs=1, render_mode=None, size=11, buf=None):
        self.single_observation_space = gymnasium.spaces.Box(low=0, high=1,
            shape=(size*size,), dtype=np.uint8)
        self.single_action_space = gymnasium.spaces.Discrete(5)
        self.render_mode = render_mode
        self.num_agents = num_envs

        super().__init__(buf)
        self.c_envs = CySquared(self.observations, self.actions,
            self.rewards, self.terminals, num_envs, size)
 
    def reset(self, seed=None):
        self.c_envs.reset()
        return self.observations, []

    def step(self, actions):
        self.actions[:] = actions
        self.c_envs.step()

        episode_returns = self.rewards[self.terminals]

        info = []
        if len(episode_returns) > 0:
            info = [{
                'reward': np.mean(episode_returns),
            }]

        return (self.observations, self.rewards,
            self.terminals, self.truncations, info)

    def render(self):
        self.c_envs.render()

    def close(self):
        self.c_envs.close()

The key feature here is the allocation of data buffers for observations, rewards, etc. When multiprocessing, PufferLib will place these into a single shared memory tensor, so your data will be available on the main process immediately. In other words, your environment computes observations directly into shared memory with no redundant copies or slow Python glue. The base class is less than 100 lines. We suggest reading it just to see that there is no magic.

The logic for this environment is imported. It is written in C with an intermediate binding in Cython. This is not part of the API. You can implement environment logic in Python or any other language you like. The only hard requirement is that you write data into the provided self.observations, self.actions, self.rewards, self.terminals, and self.truncations buffers. Additionally, there are two soft requirements that we plan to revisit. First, PufferEnvs must handle multiple environment copies internally. This one is easy to eliminate, but you probably don't want Python running this loop anyways. Second, your environment must handle resets internally. This one is just easier for now, and we will revisit if it becomes an issue.

A key goal of PufferLib is to give you the maximum performance based on the amount of engineering effort you are willing to put into your environment. Here is a full list of options, from fastest to slowest:

Third Party Environments

You can use any well-behaved environment with PufferLib via a 1-line wrapper. These are just the environments we have gotten around to binding manually. A lot of environments are not well behaved and subtly deviate from the Gymnasium/PettingZoo API, or they use obscure features that most libraries are not designed to handle. Our bindings just help clean this up a bit. Plus, we add any tricky system package dependencies to PufferTank. Feel free to PR new bindings or fixes for existing ones!

Star OpenAI Gym

OpenAI Gym is the standard API for single-agent reinforcement learning environments. It also contains some built-in environments. We include Box2D in our registry.

Star Pokemon Red

Pokemon Red is one of the original Pokemon games for gameboy. This project uses the game as an environment for reinforcement learning. We are actively supporting development on this one!

Star PettingZoo

PettingZoo is the standard API for multi-agent reinforcement learning environments. It also contains some built-in environments. We include Butterfly in our registry.

Star Arcade Learning Environment

Arcade Learning Environment provides a Gym interface for classic Atari games. This is the most popular benchmark for reinforcement learning algorithms.

Star Minigrid

Minigrid is a 2D grid-world environment engine and a collection of builtin environments. The target is flexible and computationally efficient RL research.

Star MAgent

MAgent is a platform for large-scale agent simulation.

Star Neural MMO

Neural MMO is a massively multiagent environment for reinforcement learning. It combines large agent populations with high per-agent complexity and is the most actively maintained (by me) project on this list.

Star Procgen

Procgen is a suite of arcade games for reinforcement learning with procedurally generated levels. It is one of the most computationally efficient environments on this list.

Star NLE

Nethack Learning Environment is a port of the classic game NetHack to the Gym API. It combines extreme complexity with high simulation efficiency.

Star MiniHack

MiniHack Learning Environment is a stripped down version of NetHack with support for level editing and custom procedural generation.

Star Crafter

Crafter is a top-down 2D Minecraft clone for RL research. It provides pixel observations and relatively long time horizons.

Star GPUDrive

GPUDrive GPUDrive is a GPU-accelerated, multi-agent driving simulator that runs at 1 million FPS.

Star Griddly

Griddly is an extremely optimized platform for building reinforcement learning environments. It also includes a large suite of built-in environments.

Star MicroRTS-Py

Gym MicroRTS is a real time strategy engine for reinforcement learning research. The Java configuration is a bit finicky -- we're still debugging this.