Case Study: A Dive Into LeggedGym and RSL-RL Framework

Code Organization

The majority of the logic is implemented in the legged_gym/env/base/legged_robot.py file. The code can be partitioned into the following five sections:

1. Environment creation

How to create the physics environment

2. Adding the agent asset

How to initialize the agent asset (urdf) in the environment, dealing with initial position, joint characteristics etc.

3. Reward design

How to formulate reward to enable efficient learning

4. Exploration mechanism

How to control the aggressiveness of the policy to explore new policy spaces

5. Driving the simulation

How to record the training process and keep the most optimal policy

To understand the code, we can begin by following through the code execution flow, and see step by step how the environment and agent is set up, and how the simulation is driven.

The entry script

We start by looking at the script that user will invoke, which are either legged_gym/scripts/train.py for training, or legged_gym/scripts/play.py for policy inference.

The script interacts with task_registry module.

Step 1: Parse command line arguments

The first immediate step is to parse the command line arguments with the get_args() helper function.

Table 1 shows the overview of the default supported commands.

The arguments are then sent to isaacgym to parse with the gymutil.parse_arguments() function.

The input to the function is our command line options as a dict (custom_parameters):

[
  {'name': '--task', 'type': <class 'str'>, 'default': 'go2', 'help': 'Resume training or start testing from a checkpoint. Overrides config file if provided.'}, 
  {'name': '--resume', 'action': 'store_true', 'default': False, 'help': 'Resume training from a checkpoint'}, 
  {'name': '--experiment_name', 'type': <class 'str'>, 'help': 'Name of the experiment to run or load. Overrides config file if provided.'}, 
  ...
  {'name': '--max_iterations', 'type': <class 'int'>, 'help': 'Maximum number of training iterations. Overrides config file if provided.'}
  ]

and the function returns a config Namespace as output (args):

Namespace(
  checkpoint=None, 
  compute_device_id=0, 
  experiment_name=None, 
  flex=False, 
  graphics_device_id=0, 
  headless=False, 
  horovod=False, 
  load_run=None, 
  max_iterations=None, 
  num_envs=4, 
  num_threads=0, 
  physics_engine=SimType.SIM_PHYSX, 
  physx=False, 
  pipeline='gpu', 
  resume=False, 
  rl_device='cuda:0', 
  run_name=None, 
  seed=None, 
  sim_device='cuda:0', 
  sim_device_type='cuda', 
  slices=0, 
  subscenes=0, 
  task='g1_leg', 
  use_gpu=True, 
  use_gpu_pipeline=True
  )

Step 2: Make envrionment

Then, the task_registry.make_env() function is used to create the simulation environment.

This function finds the corresponding VecEnv instance and the configurations (LeggedRobotCfg for base, and LeggedRobotCfgPPO for training) in the registered environments.

It might be a bit hard to understand the different classes defined by the legged_gym and rsl_rl.

VecEnv is a rsl_rl abstract environment. It defines the generic environment that can be utilized by the learning algorithm to have the following attributes and methods:

  • num_envs: int - Number of environments

  • num_obs: int - Number of observations

  • num_privileged_obs: int - Number of privileged observations

  • num_actions: int - Number of actions

  • max_episode_length: int - Maximum episode length

  • privileged_obs_buf: torch.Tensor - Buffer for privileged observations

  • obs_buf: torch.Tensor - Buffer for observations

  • rew_buf: torch.Tensor - Buffer for rewards

  • reset_buf: torch.Tensor - Buffer for resets

  • episode_length_buf: torch.Tensor - Buffer for current episode lengths

  • extras: dict - Extra information (metrics), containing metrics such as the episode reward, episode length, etc. Additional information can be stored in the dictionary such as observations for the critic network, etc

  • device: torch.device - Device to use.

  • get_observations(self) -> tuple[torch.Tensor, dict]

  • reset(self) -> tuple[torch.Tensor, dict]

  • step(self, actions: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, dict]

The LeggedRobot environment, which is a subclass of BaseTask, is an environment defined by legged_gym which also satisfies the VecEnv requirements and provides the implementation of the three methods.

There is no direct inheritance relation between these two classes, but they are defined to match and thus can be used interchangably.

A custom environment can be registered using the function

task_registry.register(name: str, task_class: VecEnv, env_cfg: LeggedRobotCfg, train_cfg: LeggedRobotCfgPPO)

The registation happens in envs/__init__.py file.

The base environment

By default, the environment need not to be changed. It's all LeggedRobot environment. Only the configurations need to be changed for different robot training environments.

The base environment LeggedRobot is defined in envs/base/legged_robot.py

It provides all the necessary functions to perform the following tasks:

  • defines how reward and observations are computed, including the definition of each reward terms in _reward_<term>().

  • defines the joint level PID controller to map from actions to raw torques in _compute_torques()

  • initializes the tensor buffers that can be loaded on GPU.

  • configures and loads the URDF asset into the environment.

  • implements the step() function.

Environment initialization logic

When creating the LeggedRobot class, it will initialize the environment with the following procedure:

  1. Parses the configuration class which converts subclasses into dictionaries, and setting up the time delta.

  2. Initializes the parent BaseTask class, which performs several tasks.

    1. acquires gym handler

    2. set up simulation device

    3. initialize base parameters including num_envs, num_obs, num_priviledged_obs, num_actions from the config dict

    4. create tensor buffers that is required by VecEnv, such as obs_buf, rew_buf , etc.

    5. create gym simulation environment

    6. create viewer if not in headless mode

  3. Set up camera.

  4. Create tensor buffers that interfaces with the gym simulation tensor data on GPU, such as root_states, dof_pos, and tensor buffers for reward calculations, such as torques, feet_air_time.

    Addtionaly, this part will also initialize the default joint position and PD parameters to be the correct value specified in the config class.

  5. Initialize reward functions as a list.

Environment stepping logic

The step() function is abstract in the BaseTask class, and is implemented in the LeggedRobot class. It performs the following operations for each environment step:

  1. Clips the input actions argument according to the value specified in the config, and send it to GPU

  2. Renders the current scene in self.render(). The render function also checks keyboard input events.

  3. For each rendering FPS (also control frequency), perform decimation number of physics updates:

    1. compute torques from the input action and set it in sim

    2. step the physics for one iteration

    3. updates the dof_state tensor

  4. Then, do the following in self.post_physics_step()

    1. refreshes the actor root state and net contact force tensors

    2. calculates the root pose and velocity

    3. performs self._post_physics_step_callback(), which includes

      1. resamples command for the next step

      2. calculate the surrounding terrian height, if applicable

      3. push robot, if applicable

    4. checks termination condition of either collision or timeout

    5. computes reward from the reward function list

    6. resets the environment instance that is terminated

    7. compute observations for the next step

    8. update the action, dof_vel, and root_vel history buffer

    9. finally, draw debug visualization if applicable

Step 3: Make algorithm runner

uses task_registry.make_alg_runner() function

It sets up the tensorboard logging, and initializes and returns the OnPolicyRunner class instance from rsl_rl library.

The rsl_rl library currently only supports PPO algorithm.

PPO algorithm

Proximal Policy Optimization (PPO) is an actor-critic method, meaning it utilizes both a policy network (the actor) and a value network (the critic). The actor is responsible for selecting actions, while the critic estimates the value function (expected reward). PPO aims to improve the policy by taking small steps to ensure stable and reliable updates.

Step 4: Learn

By invoking the ppo_runner.learn(num_learning_iterations, init_at_random_ep_len) function, the rsl_rl framework automatically performs the policy training iterations.

PPO training flow

  1. Initialize networks and storage buffer.

  2. Collect trajectories using the current policy.

  3. Compute advantages and targets.

  4. Update policy (actor) network using the clipped objective.

  5. Update value (critic) network by minimizing value loss.

  6. Repeat until convergence.

1: Initialize Networks and Storage Buffer

Upon creation of the OnPolicyRunner instance, it gets and initializes the Algorithm class specified by self.cfg["policy_class_name"]

By default, the ActorCritic policy is used, which contains the actor MLP, for calculating the action based on observation, and critic MLP, for calculating the value function given all the observations, including the privileged ones.

The storage buffer RolloutStorage is also initialized at this stage which records the Transitions.

class Transition:
    def __init__(self):
        self.observations = None
        self.critic_observations = None
        self.actions = None
        self.rewards = None
        self.dones = None
        self.values = None
        self.actions_log_prob = None
        self.action_mean = None
        self.action_sigma = None
        self.hidden_states = None

Lastly, it also performs an environment reset to prepare for the first training step.

2: Collect data

For each training loop, a policy rollout is first performed.

The current policy is ran for num_steps_per_env steps to collect trajectories using the act() function.

3: Compute advantages and targets

The Generalized Advantage Estimation (GAE) method is used to calculate the advantage:

Then, the target value is computed:

def compute_returns(self, last_values, gamma, lam):
    advantage = 0
    for step in reversed(range(self.num_transitions_per_env)):
        if step == self.num_transitions_per_env - 1:
            next_values = last_values
        else:
            next_values = self.values[step + 1]
        next_is_not_terminal = 1.0 - self.dones[step].float()
        delta = self.rewards[step] + next_is_not_terminal * gamma * next_values - self.values[step]
        advantage = delta + next_is_not_terminal * gamma * lam * advantage
        self.returns[step] = advantage + self.values[step]

    # Compute and normalize the advantages
    self.advantages = self.returns - self.values
    self.advantages = (self.advantages - self.advantages.mean()) / (self.advantages.std() + 1e-8)

4: Update actor network

The actor and critic are updated in the update() function.

5: Update value (critic) network

The actor and critic are updated in the update() function.

Reward design

LeggedGym provides a bunch of reward terms by default.

lin_vel_z

Penalize the z axis base linear velocity. Prevents the robot from shaking up and down.

The reward coefficient of this term should be negative.

ang_vel_xy

Penalize the xy axes base angular velocity. Prevents the robot from vibrates and rotating sideway.

The reward coefficient of this term should be negative.

base_orientation

Penalize non-flat base orientation.

The reward coefficient of this term should be negative.

base_height

Penalize the base height tracking error.

The reward coefficient of this term should be negative.

torques

Penalize large torque and energy consumption.

The reward coefficient of this term should be negative.

dof_vel

Penalize joint velocities.

The reward coefficient of this term should be negative.

dof_acc

Penalize joint accelerations.

The reward coefficient of this term should be negative.

action_rate

Penalize change in actions. Prevents glitches.

The reward coefficient of this term should be negative.

collision

Penalize collisions on selected bodies.

The reward coefficient of this term should be negative.

termination

Penalize for termination

The reward coefficient of this term should be negative.

dof_pos_limits

Penalize joint states that violates joint limit

The reward coefficient of this term should be negative.

dof_vel_limits

Penalize joint velocities that violates the velocity limit.

The reward coefficient of this term should be negative.

torque_limites

Penalize joint torques that violates the torque limit

The reward coefficient of this term should be negative.

tracking_lin_vel

Rewards for tracking the command xy velocity goals.

The reward coefficient of this term should be positive.

tracking_ang_vel

Rewards for tracking the command yaw angular velocity goals.

The reward coefficient of this term should be positive.

feet_air_time

Reward for how long the robot's feet stay in the air during steps. Encourages longer steps and prevents dragging feet on ground. The reward for each leg is only granted upon the contact with ground after airtime.

The reward coefficient of this term should be positive.

stumble

Penalizes feet hitting vertical surfaces.

The reward coefficient of this term should be negative.

stand_still

Penalize motion at zero commands

The reward coefficient of this term should be negative.

feet_contact_forces

Penalizes high contact forces

The reward coefficient of this term should be negative.

Step 5: Inference

To perform interence, the ppo_runner.get_inference_policy(device) function returns the forward() method of the actor network.

Last updated