Skip to content

flodorner/Augmented_Constrained_RL

Repository files navigation

Augmented Contstrained RL

Constrained Reinforcement Learning using State Augmentation

Supported Platforms

This package has been tested on CentOS 7 and Ubuntu 20.04 LTS, and probably works fine for most recent Linux operating systems.

Requires Python 3.6.x

Installation

Apart from standard libraries installable via pip (torch,matplotlib,numpy,pickle,datetime,sys,os) our code depends on mujoco-py, safety-gym and spinningup by OpenAI. This sections runs you through a quick installation of the required python packages.

Installing MuJoCo

MuJoCo is a physics engine for detailed, efficient rigid body simulations with contacts. mujoco-py allows using MuJoCo from Python 3. Further details about mujoco-py can be found here.

  1. Obtain a MuJoCo trial license by visiting the Mujoco website. Students can request a free license for personal projects.
  2. Download the mujoco200 package from this link.
  3. Unzip the downloaded mujoco200 directory into ~/.mujoco/mujoco200, and place your license key (the mjkey.txt file from your license email) at ~/.mujoco/mjkey.txt.
  4. Before installing mujoco-py on Ubuntu, make sure you have the following libraries installed:
sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3
  1. Now install mujoco-py using pip:
pip install -U 'mujoco-py<2.1,>=2.0'

Installing Safety Gym

Safety Gym is a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training. More information can be found here

  1. First install openai-gym:
pip install gym
  1. Afterwards, simply install Safety Gym by:
git clone https://github.com/openai/safety-gym.git

cd safety-gym

pip install -e .

Installing SpinningUp

SpinningUp contains a code repo of the implementation of key Reinforcement Learning algorithms including Soft Actor-Critic, Proximal Policy Optimization and Twin Delayed DDPG used in this project.

We use a forked repository of the original SpinninpUp where we implement changes required for State Augmented Constrained RL, as well as the more robust adaptive entropy penalty for SAC.

  1. First install OpenMPI:
sudo apt-get update && sudo apt-get install libopenmpi-dev
  1. Now install the forked spinningup:
git clone https://github.com/flodorner/spinningup.git
cd spinningup
pip install -e .

Code Structure

wrapper.py: Defines the constraint_wrapper class which serves as a wrapper around the safety-gym env class. The step method of the constraint_wrapper class returns a cost-augmented state observations and a cost-modified reward.

test_constraints.py: Creates an instance of the constraint_wrapper class and starts an experiment with given arguments.

test_noconstraints.py: Creates an instance of the safety-gym env class and starts training an unconstrained agent with default arguments.

run.py: Run an experiment from the set of experiments listed in the proposal.

Running Experiments

Predefined Experiments

The experiments can simply be run by:

python run.py --id {exp_id}

where {exp_id} is the id of the experiment you wish to run.

To reproduce the results in the main report, only experiments 3,4,5,6,10,11,12,14,15 and 51 need to be run. We kept the experiments in the same combination as they were run intially to avoid introducing mistakes, but calls to run_exp that use a name not mentioned in the first five plots in Plot_results.ipynb should be safely omittable to make things faster if only the plots from the main report are replicated.

Custom Experiments

You can also run your custom experiments by passing runtime arguments to test_constraints.py. For example, start an experiment with Soft Actor-Critic by running:

python test_constraints.py --alg 'sac'

The following arguments are available:

--alg: alg determines wheter sac, ppo or td3 is used.
--alpha: alpha is the exploration parameter in sac.
--add_penalty: add_penalty is Beta from the proposal.
--mult_penalty: If mult_penalty is not None, all rewards get multiplied by it once the constraint is violated (1-alpha from the proposal).
--cost_penalty: cost_penalty is equal to zeta from the proposal.
--buckets: buckets determines how the accumulated cost is discretized for the agent.
--epochs: Epochs indicates how many epochs to train for.
--start_step: start_steps indicates how many random exploratory actions to perform before using the trained policy. 
--split_policy: split_policy changes the network architecture such that a second network is used for the policy and q-values when the constraint is violated. 
--safe_policy: safe_policy indicates the saving location for a trained safe policy. If provided, the safe policy will take over whenever the constraint is violated.
--name: name determines where in the results folder the res29541.pts-3.tensorflow-1-vmults and trained policy get saved to.
--env_name: env_name indicates the name of the enviroment the agent trains on. Can be chosen from one of the safety-gym environments.

Results

The results of all the experiments listed in run.py can be found at this polybox link. Per-episode rewards and costs are stored in pickle files ({experiment_name} _ rews.pkl and {experiment_name} _ costs.pkl) in order. If the training loop uses test epsiodes, these need to be excluded for the analysis, as done in Plot_results.Ipynb for our experiments.

Known Issues

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published