-T.K.- Lab Notes
  • Home
  • Convention Used
  • STM32
    • Getting Started - STM32 Edition
      • Setting up STM32CubeIDE
      • Going Through A Starter Project
      • Changing STM32CubeIDE Settings
      • Pinout Quick Reference
    • Misc
      • Using Nucleo STLink to Flash Off-board Chips
      • Changing STM32 Default Boot Option
      • STM32 Flash Option Byte Recovery
      • STM32 Systick and Timeout in Interrupt Routines
      • Telesky ST-Link V2 Upgrade Firmware
      • Some Performance Measurements on STM32 MCUs
    • System Core
      • Using GPIO on STM32
      • Setting up External Interrupt on STM32
    • Analog
      • Using ADC on STM32
      • ADC Reading Sequence with DMA on STM32
      • Using OPAMP on STM32
      • Using DAC on STM32
    • Timers
      • Using RTC on STM32
      • Using TIM on STM32
    • Connectivity
      • UART
      • USART
        • USART - CAN Dongle (Fixed Size Serializer with Robust Timeout Handling)
      • CAN
      • FDCAN
      • I2C
      • SPI
        • SPI - GC9A01A LCD Screen
        • SPI - RFID
        • SPI - SD Card
      • Ethernet
        • Ethernet - LWIP
        • Ethernet - UDP
        • Ethernet - UDP Multicast
      • USB - FS
      • USB - HS
    • Middleware
      • FreeRTOS
    • Software Pack
      • STMicroelectronics.X-CUBE-AI - Sine Approximator
  • RISC-V / SoC
    • RISC-V: Baremetal From The Ground Up (Chipyard Edition)
    • Quick Start With Chipyard on Ubuntu or WSL
    • Other Chipyard Stuff
      • Debugging OsciArty with JTAG and command line GDB
      • Debugging BearlyML with JTAG and GDB
      • Booting BearlyML With External SPI Flash
      • Setting Up SD / microSD Card for vcu118 Linux Image
      • More Chipyard Stuff
    • A Minimal Chisel Development Environment with Mill
    • Vivado Stuff
      • Installing Xilinx Vivado on Ubuntu 22.04 / 24.04
      • Arty 35T / 100T UART Pins
      • Configuring Vivado DDR MIG on Arty 35T
      • Configuring Vivado DDR MIG on Nexys Video
      • Vivado Generate Flash Config .mcs File From Bitstream
      • Vivado TCL Scripts
    • Adding Custom Instructions to RISC-V GCC Toolchain
    • Kendryte K230 Bringup
      • K230 EVB Board Resource Overview
    • Intel FPGA Quartus
    • Setting up RISC-V Toolchain on Ubuntu 24.04/22.04
    • Getting Started with Zephyr
      • Getting Start with Zephyr on RISC-V System - Windows
      • Getting Started with Zephyr on RISC-V - Ubuntu
    • C Library Compile Magic
    • Setting up ExecuTorch on Ubuntu 22.04
      • Executorch on ARM
  • Motor Control
    • Recoil FOC Motor Controller
      • 0x00. Theory of Operation
      • 0x01. Components
      • 0x02. Implementation
      • 0x03. Application
    • Recoil Documentation
    • New Controller Board Soldering & Power-on Checklist
    • MJBOTS Moteus setup
    • Failed Attempt on Acceleration- and Velocity-Limited Trajectory Generation
    • Moteus Code Analyze
    • MIT Motor Controller Code Analyze
    • ODrive Setup
    • Setting up Recoil USB-CAN Adapter
      • Setting up Recoil USB-CAN Adapter - Ubuntu
      • Setting up Recoil USB-CAN Adapter - Windows
    • NTC Temperature Sense Resistor Value Calculation
  • ML/RL
    • Setting up NVIDIA Tools
      • Setting up NVIDIA Driver on Ubuntu 22.04 / 20.04
      • Getting Started with NVIDIA Isaac Lab on Ubuntu 22.04 / 24.04
      • Setting up Omniverse on Ubuntu 24.04 (2025 Ver)
      • Creating Custom Training Environment in IsaacLab via Extensions
      • NVIDIA Isaac Gym URDF Import Notes
      • Setting up TensorRT Environment on Ubuntu 22.04 / 20.04
      • Setting up NVIDIA Omniverse Isaac Sim on Ubuntu 22.04 / 20.04
      • Setting up NVIDIA Nsight System and Nsight Compute on Ubuntu 24.04
      • Getting Started with Jetson AGX Orin
        • Getting Started with Jetson Using SDK Manager on Ubuntu 22.04
        • Using Jetson AGX Orin with Provided Ubuntu 20.04 System
        • Setting up Common Software on Jetson AGX Orin
        • Solving USB-CAN and USB CH340 Driver Issue on reComputer Mini J4012
        • [Deprecated] Upgrading Jetson AGX Orin to Ubuntu 22.04
      • Solving Torch Errors
      • [Deprecated] Setting up NVIDIA Isaac Gym on Ubuntu 22.04 / 20.04
    • RL Frameworks
      • Case Study: A Dive Into LeggedGym and RSL-RL Framework
      • Case Study: A Dive Into IsaacLab
      • Getting Started with Mujoco
      • Case Study: A Dive Into Unitree-Mujoco
      • Case Study: Setting up Berkeley Humanoid
      • Case Study: Looking into robot_lab
      • Case Study: Setting up RL-SAR
      • Case Study: Getting Started with LeRobot
      • Case Study: No-Mercy Project
        • Python Mouse and Keyboard Interaction in Game Environment
        • Detecting Phara
      • OpenAI gym + Mujoco Setup
      • Gazebo Setup
    • ROS
      • Setting up ROS on Ubuntu 20.04
      • Setting up ETH ANYbotics/elevation_mapping on Ubuntu 20.04
    • ROS 2
      • Setting up ROS 2 Humble Hawksbill on Ubuntu
      • Setting up ROS 2 Humble Hawksbill on Windows 10
      • ROS 2 Issue in Ubuntu with conda
    • Google Colab
      • Colab Resource Options
      • so-vits-svc 4.0: Colab Flow
    • URDF to MJCF Mujoco Notes
    • OnShape to URDF
    • Audio Stuff
      • Microsoft TTS
      • GPTSoVITS
      • 深入浅出理解 So-VITS-SVC 原理
      • NAI-SVC Experiment Log
      • Setting up ChatTTS on Ubuntu 22.04
    • Setting up AnythingLLM on Ubuntu 22.04
    • Setting up MineDojo Environment
    • Processing the SFU Motion Capture Dataset
    • Torch Profiling
    • Setting up Unitree A1
  • 3D Modeling
    • 3D Print Tolerancing
    • Blender to OnShape Workflow
    • Onshape to Blender Workflow
    • Setting up FBX Plugin for Python on Ubuntu 22.04
    • Install Blender on Ubuntu 22.04
    • Blender Python Related
    • VRoid, MMD, Blender Workflow
  • Tools
    • Windows
      • Install WSL 2
      • Install Make on Windows
      • Remove EFI disk partition
      • SAI Color Flip/Color Inversion
      • Microsoft Visual Studio Create Software Signature
      • Connecting the SIGLENT SDS1104X-U Oscilloscope to Computer
      • Using JADENS Thermal Label Printer
      • Getting Started with XBee (ZigBee)
    • Ubuntu
      • Ubuntu 22.04 Standard Installation Procedure
      • Protobuf
      • Setting up Docker on Ubuntu 22.04
      • Linux Mounting SD Card
      • Partitioning SD card
      • Windows Ubuntu Dual Boot Issues
      • Check Disk / Folder / File Size
      • Test Disk Read/Write Speed
      • Cannot Start Chrome in Ubuntu 22.04 After Changing Network Settings
      • Configure USB Access Permissions (udev rules) on Ubuntu
      • Screen Commands
      • Disabling the "<Application> is not responding." System Message on Ubuntu
      • Install and Configure GlobalProtect UC Berkeley VPN Service on Ubuntu 22.04
      • Solving Gamepad not Detected on Ubuntu 22.04
      • Using 3DConnexion Mouse on Ubuntu with Python
      • Install Cursor the AI Editor on Ubuntu 22.04/24.04
      • Solving the .nfsXXX file cannot be deleted issue
      • Windows Remote Desktop Issues
      • nsswitch.conf
    • Lab Automation
    • Github-Related Info
    • Python
      • Publish Python Package to PyPi
      • Python Logging Utility
      • Python converting bettwen JSON and XML
      • Retrieve Github user avatar with Github API
      • Jupyter Notebook Error
    • Raspberry Pi Setup
    • Clang-Format Style Config
    • CrazyFlie Setting Up
    • Using Oscilloscope: x1 vs x10
    • Using the BWRC 3D Printer
    • Using the Leica Microscope at BWRC
    • Pair XBoxController to Raspberry Pi with Bluetooth
    • Reading FrSky Transmitter SBUS data with STM32
    • Configuring the FrSky TARANIS X9D Plus 2019 RC Controller
    • Applying Notion for Education
    • Gitbook Errata
    • Setting up SteamVR without HMD
    • CMake Best Practices
    • Adobe Premiere Pro Audio Level Settings
  • Mechanical
    • MAD Cycloidal Actuator
    • Dog Stuff
      • Fixing the Unitree A1 Robot Dog Leg Motor
      • Fixing the Unitree A1 Robot Dog Ethernet Port
      • Fixing MIT Mini Cheetah
      • Fixing the Unitree Go1 Robot Dog Ethernet Port
    • 3D Printer Profile
  • Electrical
    • A Note on the Polarity of the Famous TT Motor
    • Wiring Pinmap Convention
    • MCU Pinmap Convention
    • PCB Design and Manufacturing Conventions
    • ESP32 Cam
    • LiPo Safety
    • AS5600 Modification
    • OpenOCD and FTDI Chips
    • FT-LINK FTDI Debugger Design Considerations
    • A Study on Reset Pin Connection
    • Note on CAN Termination Resistor
  • UW
    • Digital-Twin Communication System
    • Unreal Engine Communicate with SteamVR
    • Unreal Engine Socket Communication
    • A Note on Coordinate Systems
    • NewLine Serialization Method
    • Humanoid Design Notes
      • Robot Body Ratio Issue
      • VRM Parameters
      • Note on Face Design and Manufacture
  • Workflow Automation
    • RISC-V Toolbox Website
    • Zigbee-Based Home Automation
      • Setting up Home Assistant on Raspberry Pi to Control Zigbee IoT Devices
      • Update Sonoff Zigbee 3.0 USB Dongle Plus (CC2652P)
  • Finance
    • Finance
    • UC Berkeley Reimbursement
  • Life
    • Some Interview Questions
    • Health Insurance
Powered by GitBook
On this page
  • Code Organization
  • The entry script
  • Step 1: Parse command line arguments
  • Step 2: Make envrionment
  • The base environment
  • Environment initialization logic
  • Environment stepping logic
  • Step 3: Make algorithm runner
  • PPO algorithm
  • Step 4: Learn
  • PPO training flow
  • Reward design
  • Step 5: Inference

Was this helpful?

  1. ML/RL
  2. RL Frameworks

Case Study: A Dive Into LeggedGym and RSL-RL Framework

Code Organization

The majority of the logic is implemented in the legged_gym/env/base/legged_robot.py file. The code can be partitioned into the following five sections:

1. Environment creation

How to create the physics environment

2. Adding the agent asset

How to initialize the agent asset (urdf) in the environment, dealing with initial position, joint characteristics etc.

3. Reward design

How to formulate reward to enable efficient learning

4. Exploration mechanism

How to control the aggressiveness of the policy to explore new policy spaces

5. Driving the simulation

How to record the training process and keep the most optimal policy

To understand the code, we can begin by following through the code execution flow, and see step by step how the environment and agent is set up, and how the simulation is driven.

The entry script

We start by looking at the script that user will invoke, which are either legged_gym/scripts/train.py for training, or legged_gym/scripts/play.py for policy inference.

The script interacts with task_registry module.

Step 1: Parse command line arguments

The first immediate step is to parse the command line arguments with the get_args() helper function.

Table 1 shows the overview of the default supported commands.

name
type
default
help

task

str

go2

Resume training or start testing from a checkpoint. Overrides config file if provided.

resume

bool

False

Resume training from a checkpoint

experiment_name

str

Name of the experiment to run or load. Overrides config file if provided.

run_name

str

Name of the run. Overrides config file if provided.

load_run

str

Name of the run to load when resume=True. If -1: will load the last run. Overrides config file if provided.

checkpoint

int

Saved model checkpoint number. If -1: will load the last checkpoint. Overrides config file if provided.

headless

bool

False

Force display off at all times

horovod

bool

False

Use horovod for multi-gpu training

rl_device

str

cuda:0

Device used by the RL algorithm, (cpu, gpu, cuda:0, cuda:1 etc..)

num_envs

int

Number of environments to create. Overrides config file if provided.

seed

int

Random seed. Overrides config file if provided.

max_iterations

int

Maximum number of training iterations. Overrides config file if provided.

The arguments are then sent to isaacgym to parse with the gymutil.parse_arguments() function.

The input to the function is our command line options as a dict (custom_parameters):

[
  {'name': '--task', 'type': <class 'str'>, 'default': 'go2', 'help': 'Resume training or start testing from a checkpoint. Overrides config file if provided.'}, 
  {'name': '--resume', 'action': 'store_true', 'default': False, 'help': 'Resume training from a checkpoint'}, 
  {'name': '--experiment_name', 'type': <class 'str'>, 'help': 'Name of the experiment to run or load. Overrides config file if provided.'}, 
  ...
  {'name': '--max_iterations', 'type': <class 'int'>, 'help': 'Maximum number of training iterations. Overrides config file if provided.'}
  ]

and the function returns a config Namespace as output (args):

Namespace(
  checkpoint=None, 
  compute_device_id=0, 
  experiment_name=None, 
  flex=False, 
  graphics_device_id=0, 
  headless=False, 
  horovod=False, 
  load_run=None, 
  max_iterations=None, 
  num_envs=4, 
  num_threads=0, 
  physics_engine=SimType.SIM_PHYSX, 
  physx=False, 
  pipeline='gpu', 
  resume=False, 
  rl_device='cuda:0', 
  run_name=None, 
  seed=None, 
  sim_device='cuda:0', 
  sim_device_type='cuda', 
  slices=0, 
  subscenes=0, 
  task='g1_leg', 
  use_gpu=True, 
  use_gpu_pipeline=True
  )

Step 2: Make envrionment

Then, the task_registry.make_env() function is used to create the simulation environment.

This function finds the corresponding VecEnv instance and the configurations (LeggedRobotCfg for base, and LeggedRobotCfgPPO for training) in the registered environments.

It might be a bit hard to understand the different classes defined by the legged_gym and rsl_rl.

VecEnv is a rsl_rl abstract environment. It defines the generic environment that can be utilized by the learning algorithm to have the following attributes and methods:

  • num_envs: int - Number of environments

  • num_obs: int - Number of observations

  • num_privileged_obs: int - Number of privileged observations

  • num_actions: int - Number of actions

  • max_episode_length: int - Maximum episode length

  • privileged_obs_buf: torch.Tensor - Buffer for privileged observations

  • obs_buf: torch.Tensor - Buffer for observations

  • rew_buf: torch.Tensor - Buffer for rewards

  • reset_buf: torch.Tensor - Buffer for resets

  • episode_length_buf: torch.Tensor - Buffer for current episode lengths

  • extras: dict - Extra information (metrics), containing metrics such as the episode reward, episode length, etc. Additional information can be stored in the dictionary such as observations for the critic network, etc

  • device: torch.device - Device to use.

  • get_observations(self) -> tuple[torch.Tensor, dict]

  • reset(self) -> tuple[torch.Tensor, dict]

  • step(self, actions: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, dict]

The LeggedRobot environment, which is a subclass of BaseTask, is an environment defined by legged_gym which also satisfies the VecEnv requirements and provides the implementation of the three methods.

There is no direct inheritance relation between these two classes, but they are defined to match and thus can be used interchangably.

A custom environment can be registered using the function

task_registry.register(name: str, task_class: VecEnv, env_cfg: LeggedRobotCfg, train_cfg: LeggedRobotCfgPPO)

The registation happens in envs/__init__.py file.

The base environment

By default, the environment need not to be changed. It's all LeggedRobot environment. Only the configurations need to be changed for different robot training environments.

The base environment LeggedRobot is defined in envs/base/legged_robot.py

It provides all the necessary functions to perform the following tasks:

  • defines how reward and observations are computed, including the definition of each reward terms in _reward_<term>().

  • defines the joint level PID controller to map from actions to raw torques in _compute_torques()

  • initializes the tensor buffers that can be loaded on GPU.

  • configures and loads the URDF asset into the environment.

  • implements the step() function.

Environment initialization logic

When creating the LeggedRobot class, it will initialize the environment with the following procedure:

  1. Parses the configuration class which converts subclasses into dictionaries, and setting up the time delta.

  2. Initializes the parent BaseTask class, which performs several tasks.

    1. acquires gym handler

    2. set up simulation device

    3. initialize base parameters including num_envs, num_obs, num_priviledged_obs, num_actions from the config dict

    4. create tensor buffers that is required by VecEnv, such as obs_buf, rew_buf , etc.

    5. create gym simulation environment

    6. create viewer if not in headless mode

  3. Set up camera.

  4. Create tensor buffers that interfaces with the gym simulation tensor data on GPU, such as root_states, dof_pos, and tensor buffers for reward calculations, such as torques, feet_air_time.

    Addtionaly, this part will also initialize the default joint position and PD parameters to be the correct value specified in the config class.

  5. Initialize reward functions as a list.

Environment stepping logic

The step() function is abstract in the BaseTask class, and is implemented in the LeggedRobot class. It performs the following operations for each environment step:

  1. Clips the input actions argument according to the value specified in the config, and send it to GPU

  2. Renders the current scene in self.render(). The render function also checks keyboard input events.

  3. For each rendering FPS (also control frequency), perform decimation number of physics updates:

    1. compute torques from the input action and set it in sim

    2. step the physics for one iteration

    3. updates the dof_state tensor

  4. Then, do the following in self.post_physics_step()

    1. refreshes the actor root state and net contact force tensors

    2. calculates the root pose and velocity

    3. performs self._post_physics_step_callback(), which includes

      1. resamples command for the next step

      2. calculate the surrounding terrian height, if applicable

      3. push robot, if applicable

    4. checks termination condition of either collision or timeout

    5. computes reward from the reward function list

    6. resets the environment instance that is terminated

    7. compute observations for the next step

    8. update the action, dof_vel, and root_vel history buffer

    9. finally, draw debug visualization if applicable

Step 3: Make algorithm runner

uses task_registry.make_alg_runner() function

It sets up the tensorboard logging, and initializes and returns the OnPolicyRunner class instance from rsl_rl library.

The rsl_rl library currently only supports PPO algorithm.

PPO algorithm

Proximal Policy Optimization (PPO) is an actor-critic method, meaning it utilizes both a policy network (the actor) and a value network (the critic). The actor is responsible for selecting actions, while the critic estimates the value function (expected reward). PPO aims to improve the policy by taking small steps to ensure stable and reliable updates.

Step 4: Learn

By invoking the ppo_runner.learn(num_learning_iterations, init_at_random_ep_len) function, the rsl_rl framework automatically performs the policy training iterations.

PPO training flow

  1. Initialize networks and storage buffer.

  2. Collect trajectories using the current policy.

  3. Compute advantages and targets.

  4. Update policy (actor) network using the clipped objective.

  5. Update value (critic) network by minimizing value loss.

  6. Repeat until convergence.

1: Initialize Networks and Storage Buffer

Upon creation of the OnPolicyRunner instance, it gets and initializes the Algorithm class specified by self.cfg["policy_class_name"]

By default, the ActorCritic policy is used, which contains the actor MLP, for calculating the action based on observation, and critic MLP, for calculating the value function given all the observations, including the privileged ones.

The storage buffer RolloutStorage is also initialized at this stage which records the Transitions.

class Transition:
    def __init__(self):
        self.observations = None
        self.critic_observations = None
        self.actions = None
        self.rewards = None
        self.dones = None
        self.values = None
        self.actions_log_prob = None
        self.action_mean = None
        self.action_sigma = None
        self.hidden_states = None

Lastly, it also performs an environment reset to prepare for the first training step.

2: Collect data

For each training loop, a policy rollout is first performed.

The current policy is ran for num_steps_per_env steps to collect trajectories using the act() function.

Note that inside the ppo_runner.act() function, it both invokes the actor_critic.act(obs) function to get the desired actions, and the actor_critic.evaluate(criitc_obs) function to get the value V(s)V(s)V(s) of the current transition. The value is then also stored in the transition buffer. This saves the effort of calling critic again when calculating the advantage.

3: Compute advantages and targets

The advantage A(s,a)A(s, a)A(s,a) is calculated in the compute_returns() function, provided by the rollout buffer.

The Generalized Advantage Estimation (GAE) method is used to calculate the advantage:

δ=rt+γV(st+1)−V(st)\delta = r_t + \gamma V(s_{t+1}) - V(s_t)δ=rt​+γV(st+1​)−V(st​)
A(s,a)=At=δ+γtλAt+1A(s, a) = A_t = \delta + \gamma^t \lambda A_{t+1}A(s,a)=At​=δ+γtλAt+1​

Then, the target value is computed:

Rt=At+V(st)R_t = A_t + V(s_t)Rt​=At​+V(st​)

def compute_returns(self, last_values, gamma, lam):
    advantage = 0
    for step in reversed(range(self.num_transitions_per_env)):
        if step == self.num_transitions_per_env - 1:
            next_values = last_values
        else:
            next_values = self.values[step + 1]
        next_is_not_terminal = 1.0 - self.dones[step].float()
        delta = self.rewards[step] + next_is_not_terminal * gamma * next_values - self.values[step]
        advantage = delta + next_is_not_terminal * gamma * lam * advantage
        self.returns[step] = advantage + self.values[step]

    # Compute and normalize the advantages
    self.advantages = self.returns - self.values
    self.advantages = (self.advantages - self.advantages.mean()) / (self.advantages.std() + 1e-8)

4: Update actor network

The actor and critic are updated in the update() function.

5: Update value (critic) network

The actor and critic are updated in the update() function.

Reward design

LeggedGym provides a bunch of reward terms by default.

lin_vel_z

reward=vz2reward = v_z^2reward=vz2​

Penalize the z axis base linear velocity. Prevents the robot from shaking up and down.

The reward coefficient of this term should be negative.

ang_vel_xy

reward=ωx2+ωy2reward = \omega_x^2 + \omega_y^2reward=ωx2​+ωy2​

Penalize the xy axes base angular velocity. Prevents the robot from vibrates and rotating sideway.

The reward coefficient of this term should be negative.

base_orientation

reward=gx2+gy2reward = g_x^2 + g_y^2reward=gx2​+gy2​

Penalize non-flat base orientation.

The reward coefficient of this term should be negative.

base_height

reward=(posz−posz,target)2reward = (pos_z - pos_{z, target})^2reward=(posz​−posz,target​)2

Penalize the base height tracking error.

The reward coefficient of this term should be negative.

torques

reward=∑τ2reward = \sum\tau^2reward=∑τ2

Penalize large torque and energy consumption.

The reward coefficient of this term should be negative.

dof_vel

reward=∑ωi2reward = \sum\omega_i^2reward=∑ωi2​

Penalize joint velocities.

The reward coefficient of this term should be negative.

dof_acc

reward=∑(ωi,prev−ωidt)2reward = \sum (\frac{\omega_{i,prev} - \omega_i}{dt} )^2reward=∑(dtωi,prev​−ωi​​)2

Penalize joint accelerations.

The reward coefficient of this term should be negative.

action_rate

reward=∑(acsi,prev−acsidt)2reward = \sum (\frac{acs_{i,prev} - acs_i}{dt} )^2reward=∑(dtacsi,prev​−acsi​​)2

Penalize change in actions. Prevents glitches.

The reward coefficient of this term should be negative.

collision

reward=∑∥fi∥wherefi>0.1reward = \sum\|f_i\| \quad where \quad f_i > 0.1reward=∑∥fi​∥wherefi​>0.1

Penalize collisions on selected bodies.

The reward coefficient of this term should be negative.

termination

reward=1ifterminationreward = 1 \quad if \quad \text{termination}reward=1iftermination

Penalize for termination

The reward coefficient of this term should be negative.

dof_pos_limits

reward=∑∣qi,out−of−range∣reward = \sum |q_{i, out-of-range}|reward=∑∣qi,out−of−range​∣

Penalize joint states that violates joint limit

The reward coefficient of this term should be negative.

dof_vel_limits

reward=∑clip(∣ωi∣−ωlim,0,1)reward = \sum clip(|\omega_i| - \omega_{lim}, 0, 1)reward=∑clip(∣ωi​∣−ωlim​,0,1)

Penalize joint velocities that violates the velocity limit.

The reward coefficient of this term should be negative.

torque_limites

reward=∑max(∣τi∣−τlim,0)reward = \sum max(|\tau_i| - \tau_{lim}, 0)reward=∑max(∣τi​∣−τlim​,0)

Penalize joint torques that violates the torque limit

The reward coefficient of this term should be negative.

tracking_lin_vel

reward=e−(vx−vx,goal)2+(vy−vy,goal)2σreward = e^{-\frac{(v_x - v_{x,goal})^2 + (v_y - v_{y,goal})^2}{\sigma}}reward=e−σ(vx​−vx,goal​)2+(vy​−vy,goal​)2​

Rewards for tracking the command xy velocity goals.

The reward coefficient of this term should be positive.

tracking_ang_vel

reward=e−(vz−vz,goal)2σreward = e^{-\frac{(v_z - v_{z,goal})^2}{\sigma}}reward=e−σ(vz​−vz,goal​)2​

Rewards for tracking the command yaw angular velocity goals.

The reward coefficient of this term should be positive.

feet_air_time

reward=∥cmdx+cmdy∥∫tdtreward = \|cmd_x + cmd_y\|\int t dtreward=∥cmdx​+cmdy​∥∫tdt

Reward for how long the robot's feet stay in the air during steps. Encourages longer steps and prevents dragging feet on ground. The reward for each leg is only granted upon the contact with ground after airtime.

The reward coefficient of this term should be positive.

stumble

reward=1ifcontact vertical surfacereward = 1 \quad if \quad \text{contact vertical surface}reward=1ifcontact vertical surface

Penalizes feet hitting vertical surfaces.

The reward coefficient of this term should be negative.

stand_still

reward=(∑∣qi−qi,init∣)if∥cmdx+cmdy∥<0.1reward = (\sum|q_i - q_{i, init}|) \quad if \quad \|cmd_x + cmd_y\| < 0.1reward=(∑∣qi​−qi,init​∣)if∥cmdx​+cmdy​∥<0.1

Penalize motion at zero commands

The reward coefficient of this term should be negative.

feet_contact_forces

reward=max(∑∥fcontact∥−fmax,0)reward = max(\sum \|f_{contact}\| - f_{max}, 0)reward=max(∑∥fcontact​∥−fmax​,0)

Penalizes high contact forces

The reward coefficient of this term should be negative.

Step 5: Inference

To perform interence, the ppo_runner.get_inference_policy(device) function returns the forward() method of the actor network.

Last updated 7 months ago

Was this helpful?