Case Study: A Dive Into LeggedGym and RSL-RL Framework
Code Organization
The majority of the logic is implemented in the legged_gym/env/base/legged_robot.py
file. The code can be partitioned into the following five sections:
1. Environment creation
How to create the physics environment
2. Adding the agent asset
How to initialize the agent asset (urdf) in the environment, dealing with initial position, joint characteristics etc.
3. Reward design
How to formulate reward to enable efficient learning
4. Exploration mechanism
How to control the aggressiveness of the policy to explore new policy spaces
5. Driving the simulation
How to record the training process and keep the most optimal policy
To understand the code, we can begin by following through the code execution flow, and see step by step how the environment and agent is set up, and how the simulation is driven.
The entry script
We start by looking at the script that user will invoke, which are either legged_gym/scripts/train.py
for training, or legged_gym/scripts/play.py
for policy inference.
The script interacts with task_registry
module.
Step 1: Parse command line arguments
The first immediate step is to parse the command line arguments with the get_args()
helper function.
Table 1 shows the overview of the default supported commands.
The arguments are then sent to isaacgym to parse with the gymutil.parse_arguments()
function.
The input to the function is our command line options as a dict (custom_parameters
):
and the function returns a config Namespace as output (args
):
Step 2: Make envrionment
Then, the task_registry.make_env()
function is used to create the simulation environment.
This function finds the corresponding VecEnv
instance and the configurations (LeggedRobotCfg
for base, and LeggedRobotCfgPPO
for training) in the registered environments.
It might be a bit hard to understand the different classes defined by the legged_gym and rsl_rl.
VecEnv
is a rsl_rl abstract environment. It defines the generic environment that can be utilized by the learning algorithm to have the following attributes and methods:
num_envs: int - Number of environments
num_obs: int - Number of observations
num_privileged_obs: int - Number of privileged observations
num_actions: int - Number of actions
max_episode_length: int - Maximum episode length
privileged_obs_buf: torch.Tensor - Buffer for privileged observations
obs_buf: torch.Tensor - Buffer for observations
rew_buf: torch.Tensor - Buffer for rewards
reset_buf: torch.Tensor - Buffer for resets
episode_length_buf: torch.Tensor - Buffer for current episode lengths
extras: dict - Extra information (metrics), containing metrics such as the episode reward, episode length, etc. Additional information can be stored in the dictionary such as observations for the critic network, etc
device: torch.device - Device to use.
get_observations(self) -> tuple[torch.Tensor, dict]
reset(self) -> tuple[torch.Tensor, dict]
step(self, actions: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, dict]
The LeggedRobot
environment, which is a subclass of BaseTask
, is an environment defined by legged_gym which also satisfies the VecEnv
requirements and provides the implementation of the three methods.
There is no direct inheritance relation between these two classes, but they are defined to match and thus can be used interchangably.
A custom environment can be registered using the function
The registation happens in envs/__init__.py
file.
The base environment
By default, the environment need not to be changed. It's all LeggedRobot
environment. Only the configurations need to be changed for different robot training environments.
The base environment LeggedRobot
is defined in envs/base/legged_robot.py
It provides all the necessary functions to perform the following tasks:
defines how reward and observations are computed, including the definition of each reward terms in
_reward_<term>()
.defines the joint level PID controller to map from actions to raw torques in
_compute_torques()
initializes the tensor buffers that can be loaded on GPU.
configures and loads the URDF asset into the environment.
implements the step() function.
Environment initialization logic
When creating the LeggedRobot
class, it will initialize the environment with the following procedure:
Parses the configuration class which converts subclasses into dictionaries, and setting up the time delta.
Initializes the parent
BaseTask
class, which performs several tasks.acquires gym handler
set up simulation device
initialize base parameters including
num_envs
,num_obs
,num_priviledged_obs
,num_actions
from the config dictcreate tensor buffers that is required by VecEnv, such as
obs_buf
,rew_buf
, etc.create gym simulation environment
create viewer if not in headless mode
Set up camera.
Create tensor buffers that interfaces with the gym simulation tensor data on GPU, such as
root_states
,dof_pos
, and tensor buffers for reward calculations, such astorques
,feet_air_time
.Addtionaly, this part will also initialize the default joint position and PD parameters to be the correct value specified in the config class.
Initialize reward functions as a list.
Environment stepping logic
The step()
function is abstract in the BaseTask
class, and is implemented in the LeggedRobot
class. It performs the following operations for each environment step:
Clips the input actions argument according to the value specified in the config, and send it to GPU
Renders the current scene in
self.render()
. The render function also checks keyboard input events.For each rendering FPS (also control frequency), perform
decimation
number of physics updates:compute torques from the input action and set it in sim
step the physics for one iteration
updates the dof_state tensor
Then, do the following in
self.post_physics_step()
refreshes the actor root state and net contact force tensors
calculates the root pose and velocity
performs
self._post_physics_step_callback()
, which includesresamples command for the next step
calculate the surrounding terrian height, if applicable
push robot, if applicable
checks termination condition of either collision or timeout
computes reward from the reward function list
resets the environment instance that is terminated
compute observations for the next step
update the
action
,dof_vel
, androot_vel
history bufferfinally, draw debug visualization if applicable
Step 3: Make algorithm runner
uses task_registry.make_alg_runner()
function
It sets up the tensorboard logging, and initializes and returns the OnPolicyRunner
class instance from rsl_rl library.
The rsl_rl library currently only supports PPO algorithm.
PPO algorithm
Proximal Policy Optimization (PPO) is an actor-critic method, meaning it utilizes both a policy network (the actor) and a value network (the critic). The actor is responsible for selecting actions, while the critic estimates the value function (expected reward). PPO aims to improve the policy by taking small steps to ensure stable and reliable updates.
Step 4: Learn
By invoking the ppo_runner.learn(num_learning_iterations, init_at_random_ep_len)
function, the rsl_rl framework automatically performs the policy training iterations.
PPO training flow
Initialize networks and storage buffer.
Collect trajectories using the current policy.
Compute advantages and targets.
Update policy (actor) network using the clipped objective.
Update value (critic) network by minimizing value loss.
Repeat until convergence.
1: Initialize Networks and Storage Buffer
Upon creation of the OnPolicyRunner instance, it gets and initializes the Algorithm class specified by self.cfg["policy_class_name"]
By default, the ActorCritic
policy is used, which contains the actor MLP, for calculating the action based on observation, and critic MLP, for calculating the value function given all the observations, including the privileged ones.
The storage buffer RolloutStorage
is also initialized at this stage which records the Transitions
.
Lastly, it also performs an environment reset to prepare for the first training step.
2: Collect data
For each training loop, a policy rollout is first performed.
The current policy is ran for num_steps_per_env
steps to collect trajectories using the act()
function.
3: Compute advantages and targets
The Generalized Advantage Estimation (GAE) method is used to calculate the advantage:
Then, the target value is computed:
4: Update actor network
The actor and critic are updated in the update()
function.
5: Update value (critic) network
The actor and critic are updated in the update()
function.
Reward design
LeggedGym provides a bunch of reward terms by default.
lin_vel_z
Penalize the z axis base linear velocity. Prevents the robot from shaking up and down.
The reward coefficient of this term should be negative.
ang_vel_xy
Penalize the xy axes base angular velocity. Prevents the robot from vibrates and rotating sideway.
The reward coefficient of this term should be negative.
base_orientation
Penalize non-flat base orientation.
The reward coefficient of this term should be negative.
base_height
Penalize the base height tracking error.
The reward coefficient of this term should be negative.
torques
Penalize large torque and energy consumption.
The reward coefficient of this term should be negative.
dof_vel
Penalize joint velocities.
The reward coefficient of this term should be negative.
dof_acc
Penalize joint accelerations.
The reward coefficient of this term should be negative.
action_rate
Penalize change in actions. Prevents glitches.
The reward coefficient of this term should be negative.
collision
Penalize collisions on selected bodies.
The reward coefficient of this term should be negative.
termination
Penalize for termination
The reward coefficient of this term should be negative.
dof_pos_limits
Penalize joint states that violates joint limit
The reward coefficient of this term should be negative.
dof_vel_limits
Penalize joint velocities that violates the velocity limit.
The reward coefficient of this term should be negative.
torque_limites
Penalize joint torques that violates the torque limit
The reward coefficient of this term should be negative.
tracking_lin_vel
Rewards for tracking the command xy velocity goals.
The reward coefficient of this term should be positive.
tracking_ang_vel
Rewards for tracking the command yaw angular velocity goals.
The reward coefficient of this term should be positive.
feet_air_time
Reward for how long the robot's feet stay in the air during steps. Encourages longer steps and prevents dragging feet on ground. The reward for each leg is only granted upon the contact with ground after airtime.
The reward coefficient of this term should be positive.
stumble
Penalizes feet hitting vertical surfaces.
The reward coefficient of this term should be negative.
stand_still
Penalize motion at zero commands
The reward coefficient of this term should be negative.
feet_contact_forces
Penalizes high contact forces
The reward coefficient of this term should be negative.
Step 5: Inference
To perform interence, the ppo_runner.get_inference_policy(device)
function returns the forward() method of the actor network.
Last updated