diff --git a/docs/docs/docs/features/rl_training.md b/docs/docs/docs/features/rl_training.md
new file mode 100644
index 00000000..a254e901
--- /dev/null
+++ b/docs/docs/docs/features/rl_training.md
@@ -0,0 +1,212 @@
+# RL Training
+
+The RL training module enables training manipulation policies with reinforcement learning using [rsl_rl](https://github.com/leggedrobotics/rsl_rl) (PPO). It runs fully in simulation with parallel environments and no human teleoperation required.
+
+:::note
+End-to-end RL for manipulation is challenging — reward design, exploration, and sim-to-real transfer all require significant task-specific tuning. Currently only the **LiftCube** task is supported. Support for additional tasks will be added in future updates.
+:::
+
+## Training
+
+```shell
+python scripts/datagen/rl/train.py \
+ --task LeIsaac-SO101-LiftCube-RL-v0 \
+ --num_envs 512 \
+ --max_iterations 1500 \
+ --headless
+```
+
+
+Parameter descriptions for train.py
+
+- `--task`: Gym task ID to train. Required.
+
+- `--num_envs`: Number of parallel simulation environments. More environments = faster data collection. Default: value from task config.
+
+- `--max_iterations`: Number of PPO update iterations. Default: value from agent config.
+
+- `--seed`: Random seed for reproducibility. Default: value from agent config.
+
+- `--headless`: Run without rendering window for faster training.
+
+- `--device`: Computation device, such as `cpu` or `cuda`.
+
+
+
+::::tip
+Training logs (tensorboard) are written to `logs/rsl_rl///`. Monitor progress with:
+
+```shell
+tensorboard --logdir logs/rsl_rl
+```
+
+Key metrics to watch: `Train/mean_reward` (total episode reward) and individual reward terms such as `Episode/rew_cube_height`.
+::::
+
+## Evaluation & Recording
+
+Evaluate a checkpoint visually (no recording):
+
+```shell
+python scripts/datagen/rl/play.py \
+ --task LeIsaac-SO101-LiftCube-RL-v0 \
+ --checkpoint logs/rsl_rl/lift_cube_rl//model_.pt \
+ --num_envs 1
+```
+
+Save all episodes to HDF5 (both success and failure) by adding `--record`:
+
+```shell
+python scripts/datagen/rl/play.py \
+ --task LeIsaac-SO101-LiftCube-RL-v0 \
+ --checkpoint logs/rsl_rl/lift_cube_rl//model_.pt \
+ --num_envs 1 \
+ --num_episodes 100 \
+ --record --dataset_file ./datasets/rl_eval.hdf5
+```
+
+
+Parameter descriptions for play.py
+
+- `--task`: Gym task ID. Required.
+
+- `--checkpoint`: Path to a saved model checkpoint (`.pt`). Required.
+
+- `--num_envs`: Number of parallel environments. Default: value from task config.
+
+- `--num_episodes`: Total episodes to run across all envs. `0` = run indefinitely. Default: `0`.
+
+- `--seed`: Random seed. Default: value from agent config.
+
+- `--record`: Enable HDF5 recording. Both successful and failed episodes are saved.
+
+- `--resume_recording`: Append to an existing dataset file instead of creating a new one.
+
+- `--dataset_file`: Output HDF5 file path. Default: `./datasets/rl_eval.hdf5`.
+
+- `--real-time`: Slow down simulation to real-time speed.
+
+
+
+## Reward Design
+
+The LiftCube RL task uses three reward terms:
+
+| Term | Weight | Description |
+|------|--------|-------------|
+| `cube_success` | 200.0 | One-time bonus when cube height ≥ 20 cm above robot base. Episode ends immediately after (early termination). |
+| `ee_to_cube` | 1.5 | `1 - tanh(5 × dist(TCP, cube))` — guides TCP to cube center. Range [0, 1]. |
+| `cube_height` | 10.0 | `tanh(3 × max(h - 4.6 cm, 0))` — zero below 4.6 cm, monotonically increasing above. Range [0, 1]. |
+
+**TCP (Tool Center Point)** is computed as the midpoint between the two fingertip contact surfaces, derived from `body_pos_w` and `body_quat_w` with calibrated local offsets from the USD collision mesh:
+
+- Jaw tip offset (jaw body local frame): `(0.0, -0.05, 0.02)`
+- Gripper tip offset (gripper body local frame): `(-0.012, 0.0, -0.08)`
+
+**Termination**: episode ends on timeout (15 s) or when cube height ≥ 20 cm (success).
+
+## Action Space
+
+RL training uses the `rl_so101leader` device mode — delta end-effector control with a binary gripper:
+
+| Component | Dims | Description |
+|-----------|------|-------------|
+| `arm_action` | 6 | Delta EE pose (dx, dy, dz, droll, dpitch, dyaw), scale=(0.02, 0.02, 0.02, 0.5, 0.5, 0.5) → ±2 cm / ±0.5 rad per step |
+| `gripper_action` | 1 | Binary: action > 0 → open (1.0 rad), action < 0 → close (0.2 rad) |
+| **Total** | **7** | |
+
+## Observation Space
+
+26D flat vector (concatenated):
+
+| Term | Dims |
+|------|------|
+| `joint_pos` | 6 |
+| `joint_vel` | 6 |
+| `ee_frame_state` (pos + quat, robot frame) | 7 |
+| `cube_pos_relative_to_ee` | 3 |
+| `cube_quat` (orientation in world frame) | 4 |
+| **Total** | **26** |
+
+## Adding a New RL Task
+
+1. Create `/mdp/rewards.py` with reward functions.
+2. Create `/_rl_env_cfg.py` with the RL env config class:
+
+```python
+@configclass
+class MyTaskRLEnvCfg(MyTaskEnvCfg):
+ observations: MyTaskRLObsCfg = MyTaskRLObsCfg()
+ rewards: MyTaskRLRewardsCfg = MyTaskRLRewardsCfg()
+ terminations: MyTaskRLTerminationsCfg = MyTaskRLTerminationsCfg()
+
+ def __post_init__(self):
+ super().__post_init__()
+ self.use_teleop_device("rl_so101leader") # or "bi_rl_so101leader" for bi-arm
+ self.scene.front = None # disable camera for faster training
+ self.episode_length_s = 15.0
+```
+
+3. Create `/rl_agents/rsl_rl_ppo_cfg.py` with the PPO runner config:
+
+```python
+from isaaclab.utils import configclass
+from isaaclab_rl.rsl_rl import RslRlOnPolicyRunnerCfg, RslRlPpoActorCriticCfg, RslRlPpoAlgorithmCfg
+
+@configclass
+class MyTaskRLPPORunnerCfg(RslRlOnPolicyRunnerCfg):
+ num_steps_per_env = 100
+ max_iterations = 1500
+ save_interval = 50
+ experiment_name = "my_task_rl"
+ obs_groups = {"actor": ["policy"], "critic": ["policy"]}
+
+ policy = RslRlPpoActorCriticCfg(
+ init_noise_std=0.3,
+ actor_obs_normalization=True,
+ critic_obs_normalization=True,
+ actor_hidden_dims=[256, 128, 64],
+ critic_hidden_dims=[256, 128, 64],
+ activation="elu",
+ )
+
+ algorithm = RslRlPpoAlgorithmCfg(
+ value_loss_coef=1.0,
+ use_clipped_value_loss=True,
+ clip_param=0.2,
+ entropy_coef=0.005,
+ num_learning_epochs=5,
+ num_mini_batches=4,
+ learning_rate=1.0e-3,
+ schedule="adaptive",
+ gamma=0.99,
+ lam=0.95,
+ desired_kl=0.01,
+ max_grad_norm=1.0,
+ )
+```
+
+4. Register the gym environment in `/__init__.py`:
+
+```python
+from . import rl_agents
+
+gym.register(
+ id="LeIsaac-SO101-MyTask-RL-v0",
+ entry_point="isaaclab.envs:ManagerBasedRLEnv",
+ disable_env_checker=True,
+ kwargs={
+ "env_cfg_entry_point": f"{__name__}._rl_env_cfg:MyTaskRLEnvCfg",
+ "rsl_rl_cfg_entry_point": f"{rl_agents.__name__}.rsl_rl_ppo_cfg:MyTaskRLPPORunnerCfg",
+ },
+)
+```
+
+5. Train:
+
+```bash
+python scripts/datagen/rl/train.py \
+ --task LeIsaac-SO101-MyTask-RL-v0 \
+ --num_envs 512 \
+ --headless
+```
diff --git a/docs/docs/docs/features/state_machine.md b/docs/docs/docs/features/state_machine.md
index 081b9def..994e9d14 100644
--- a/docs/docs/docs/features/state_machine.md
+++ b/docs/docs/docs/features/state_machine.md
@@ -60,7 +60,7 @@ After recording, you can replay the collected demonstrations in simulation:
python scripts/datagen/state_machine/replay.py \
--task LeIsaac-SO101-PickOrange-v0 \
--dataset_file ./datasets/pick_orange.hdf5 \
- --task_type so101_state_machine \
+ --task_type ik_so101leader \
--select_episodes 0 \
--device cuda \
--enable_cameras \
@@ -82,7 +82,7 @@ python scripts/datagen/state_machine/replay.py \
- `--replay_mode`: Replay mode — `action` replays IK pose targets, `state` replays joint positions.
-- `--task_type`: State machine device type used during recording, e.g., `so101_state_machine` or `bi_so101_state_machine`. Inferred from task name if not set.
+- `--task_type`: State machine device type used during recording, e.g., `ik_so101leader` or `bi_ik_so101leader`. Inferred from task name if not set.
- `--select_episodes`: List of episode indices to replay. Leave empty to replay all episodes.
@@ -97,7 +97,7 @@ python scripts/datagen/state_machine/replay.py \
```python
TASK_REGISTRY = {
- "LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "so101_state_machine"),
- "LeIsaac-MY-NewTask-v0": (MyNewStateMachine, "so101_state_machine"),
+ "LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "ik_so101leader"),
+ "LeIsaac-MY-NewTask-v0": (MyNewStateMachine, "ik_so101leader"),
}
```
diff --git a/docs/sidebars.js b/docs/sidebars.js
index 7f6402b3..2b037604 100644
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -123,6 +123,12 @@ const sidebars = {
link: { type: 'doc', id: 'docs/features/state_machine' },
items: [],
},
+ {
+ type: 'category',
+ label: 'RL Training',
+ link: { type: 'doc', id: 'docs/features/rl_training' },
+ items: [],
+ },
],
},
'docs/trouble_shooting',
diff --git a/scripts/datagen/rl/play.py b/scripts/datagen/rl/play.py
new file mode 100644
index 00000000..c17637a6
--- /dev/null
+++ b/scripts/datagen/rl/play.py
@@ -0,0 +1,179 @@
+"""Play or record a trained leisaac RL policy.
+
+Use --record to save episodes to HDF5. Without --record, runs in visualization mode only.
+"""
+
+import os
+import sys
+
+# Make cli_args importable (local module in IsaacLab's rsl_rl script directory)
+_ISAACLAB_RSL_RL_DIR = os.path.join(
+ os.path.dirname(__file__),
+ "..",
+ "..",
+ "..",
+ "dependencies",
+ "IsaacLab",
+ "scripts",
+ "reinforcement_learning",
+ "rsl_rl",
+)
+sys.path.insert(0, os.path.abspath(_ISAACLAB_RSL_RL_DIR))
+
+import argparse
+import signal
+
+from isaaclab.app import AppLauncher
+
+import cli_args # isort: skip
+
+parser = argparse.ArgumentParser(description="Play or record a trained leisaac RL policy.")
+parser.add_argument("--task", type=str, default=None, help="Name of the task.")
+parser.add_argument(
+ "--agent", type=str, default="rsl_rl_cfg_entry_point", help="Name of the RL agent configuration entry point."
+)
+parser.add_argument("--num_envs", type=int, default=None, help="Number of environments to simulate.")
+parser.add_argument("--seed", type=int, default=None, help="Seed used for the environment.")
+parser.add_argument("--num_episodes", type=int, default=0, help="Total episodes to run (0 = infinite).")
+parser.add_argument("--real-time", action="store_true", default=False, help="Run in real-time, if possible.")
+# HDF5 recording
+parser.add_argument("--record", action="store_true", help="Save episodes to HDF5 with success/failure tags.")
+parser.add_argument(
+ "--resume_recording", action="store_true", help="Append to an existing dataset file instead of creating a new one."
+)
+parser.add_argument("--dataset_file", type=str, default="./datasets/rl_eval.hdf5", help="Output HDF5 file path.")
+cli_args.add_rsl_rl_args(parser)
+AppLauncher.add_app_launcher_args(parser)
+args_cli, hydra_args = parser.parse_known_args()
+
+sys.argv = [sys.argv[0]] + hydra_args
+
+app_launcher = AppLauncher(args_cli)
+simulation_app = app_launcher.app
+
+import time
+
+import gymnasium as gym
+import isaaclab_tasks # noqa: F401
+import leisaac.tasks # noqa: F401
+import torch
+from isaaclab.envs import ManagerBasedRLEnvCfg
+from isaaclab.managers import DatasetExportMode
+from isaaclab.utils.assets import retrieve_file_path
+from isaaclab_rl.rsl_rl import RslRlBaseRunnerCfg, RslRlVecEnvWrapper
+from isaaclab_tasks.utils import get_checkpoint_path
+from isaaclab_tasks.utils.hydra import hydra_task_config
+from leisaac.enhance.managers import EnhanceDatasetExportMode, StreamingRecorderManager
+from rsl_rl.runners import OnPolicyRunner
+
+
+@hydra_task_config(args_cli.task, args_cli.agent)
+def main(env_cfg: ManagerBasedRLEnvCfg, agent_cfg: RslRlBaseRunnerCfg):
+ """Play or record with RSL-RL agent."""
+ agent_cfg: RslRlBaseRunnerCfg = cli_args.update_rsl_rl_cfg(agent_cfg, args_cli)
+ env_cfg.scene.num_envs = args_cli.num_envs if args_cli.num_envs is not None else env_cfg.scene.num_envs
+ env_cfg.seed = agent_cfg.seed
+ env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device
+
+ # resolve checkpoint path
+ log_root_path = os.path.abspath(os.path.join("logs", "rsl_rl", agent_cfg.experiment_name))
+ print(f"[INFO] Loading experiment from directory: {log_root_path}")
+ if args_cli.checkpoint:
+ resume_path = retrieve_file_path(args_cli.checkpoint)
+ else:
+ resume_path = get_checkpoint_path(log_root_path, agent_cfg.load_run, agent_cfg.load_checkpoint)
+
+ log_dir = os.path.dirname(resume_path)
+ env_cfg.log_dir = log_dir
+
+ # configure HDF5 recording
+ if args_cli.record:
+ output_dir = os.path.dirname(os.path.abspath(args_cli.dataset_file))
+ output_filename = os.path.splitext(os.path.basename(args_cli.dataset_file))[0]
+ if args_cli.resume_recording:
+ env_cfg.recorders.dataset_export_mode = EnhanceDatasetExportMode.EXPORT_ALL_RESUME
+ assert os.path.exists(
+ args_cli.dataset_file
+ ), "Dataset file does not exist. Remove --resume_recording to create a new one."
+ else:
+ env_cfg.recorders.dataset_export_mode = DatasetExportMode.EXPORT_ALL
+ assert not os.path.exists(
+ args_cli.dataset_file
+ ), "Dataset file already exists. Use --resume_recording to append, or choose a different path."
+ env_cfg.recorders.dataset_export_dir_path = output_dir
+ env_cfg.recorders.dataset_filename = output_filename
+ else:
+ env_cfg.recorders = None
+
+ env = gym.make(args_cli.task, cfg=env_cfg)
+
+ # replace recorder manager with streaming version for HDF5 recording
+ if args_cli.record:
+ unwrapped = env.unwrapped
+ del unwrapped.recorder_manager
+ unwrapped.recorder_manager = StreamingRecorderManager(env_cfg.recorders, unwrapped)
+ unwrapped.recorder_manager.flush_steps = 100
+ unwrapped.recorder_manager.compression = "lzf"
+
+ env = RslRlVecEnvWrapper(env, clip_actions=agent_cfg.clip_actions)
+
+ print(f"[INFO]: Loading model checkpoint from: {resume_path}")
+ runner = OnPolicyRunner(env, agent_cfg.to_dict(), log_dir=None, device=agent_cfg.device)
+ runner.load(resume_path)
+
+ policy = runner.get_inference_policy(device=env.unwrapped.device)
+
+ dt = env.unwrapped.step_dt
+ episode_count = 0
+ success_count = 0
+ interrupted = False
+
+ def signal_handler(signum, frame):
+ nonlocal interrupted
+ interrupted = True
+ print("\n[INFO] Interrupted. Cleaning up...")
+
+ original_sigint_handler = signal.signal(signal.SIGINT, signal_handler)
+
+ try:
+ obs = env.get_observations()
+ while (
+ simulation_app.is_running()
+ and not interrupted
+ and (args_cli.num_episodes <= 0 or episode_count < args_cli.num_episodes)
+ ):
+ start_time = time.time()
+ with torch.inference_mode():
+ actions = policy(obs)
+ obs, _, dones, extras = env.step(actions)
+
+ time_outs = extras.get("time_outs", torch.zeros_like(dones))
+ finished = dones.bool()
+ if finished.any():
+ successes = finished & ~time_outs.bool()
+ for i in range(finished.shape[0]):
+ if finished[i]:
+ print("Episode success!" if successes[i] else "Episode failed!")
+ episode_count += int(finished.sum().item())
+ success_count += int(successes.sum().item())
+ print(f"Success rate: {success_count / episode_count:.1%} ({success_count}/{episode_count})")
+
+ sleep_time = dt - (time.time() - start_time)
+ if args_cli.real_time and sleep_time > 0:
+ time.sleep(sleep_time)
+
+ except Exception as e:
+ import traceback
+
+ print(f"\n[ERROR] {e}\n")
+ traceback.print_exc()
+ finally:
+ signal.signal(signal.SIGINT, original_sigint_handler)
+ if args_cli.record and hasattr(env.unwrapped.recorder_manager, "finalize"):
+ env.unwrapped.recorder_manager.finalize()
+ env.close()
+
+
+if __name__ == "__main__":
+ main()
+ simulation_app.close()
diff --git a/scripts/datagen/rl/replay.py b/scripts/datagen/rl/replay.py
new file mode 100644
index 00000000..e5a5d115
--- /dev/null
+++ b/scripts/datagen/rl/replay.py
@@ -0,0 +1,143 @@
+"""Script to replay recorded RL episodes from HDF5 dataset."""
+
+import multiprocessing
+
+if multiprocessing.get_start_method() != "spawn":
+ multiprocessing.set_start_method("spawn", force=True)
+
+import argparse
+
+from isaaclab.app import AppLauncher
+
+parser = argparse.ArgumentParser(description="Replay recorded RL episodes.")
+parser.add_argument("--task", type=str, default=None, help="Name of the task.")
+parser.add_argument("--num_envs", type=int, default=1, help="Number of environments to simulate.")
+parser.add_argument(
+ "--dataset_file", type=str, default="./datasets/rl_eval.hdf5", help="File path to load recorded episodes."
+)
+parser.add_argument(
+ "--replay_mode",
+ type=str,
+ default="action",
+ choices=["action", "state"],
+ help="Replay mode: action replays actions, state replays joint states.",
+)
+parser.add_argument(
+ "--select_episodes",
+ type=int,
+ nargs="+",
+ default=[],
+ help="List of episode indices to replay. Empty = replay all.",
+)
+AppLauncher.add_app_launcher_args(parser)
+args_cli = parser.parse_args()
+
+app_launcher = AppLauncher(vars(args_cli))
+simulation_app = app_launcher.app
+
+import contextlib
+import os
+
+import gymnasium as gym
+import leisaac.tasks # noqa: F401
+import torch
+from isaaclab.envs import ManagerBasedRLEnv
+from isaaclab.utils.datasets import EpisodeData, HDF5DatasetFileHandler
+from isaaclab_tasks.utils import parse_env_cfg
+
+
+def get_next_action(episode_data: EpisodeData, return_state: bool = False) -> torch.Tensor | None:
+ if return_state:
+ next_state = episode_data.get_next_state()
+ if next_state is None:
+ return None
+ return next_state["articulation"]["robot"]["joint_position"]
+ else:
+ return episode_data.get_next_action()
+
+
+def main():
+ if not os.path.exists(args_cli.dataset_file):
+ raise FileNotFoundError(f"Dataset file not found: {args_cli.dataset_file}")
+
+ dataset_file_handler = HDF5DatasetFileHandler()
+ dataset_file_handler.open(args_cli.dataset_file)
+ episode_count = dataset_file_handler.get_num_episodes()
+
+ if episode_count == 0:
+ print("No episodes found in the dataset.")
+ return
+
+ episode_indices_to_replay = args_cli.select_episodes or list(range(episode_count))
+ num_envs = args_cli.num_envs
+
+ env_cfg = parse_env_cfg(args_cli.task, device=args_cli.device, num_envs=num_envs)
+ env_cfg.recorders = {}
+ env_cfg.terminations = {}
+
+ env: ManagerBasedRLEnv = gym.make(args_cli.task, cfg=env_cfg).unwrapped
+
+ idle_action = torch.zeros(env.action_space.shape)
+
+ if hasattr(env, "initialize"):
+ env.initialize()
+ env.reset()
+
+ episode_names = list(dataset_file_handler.get_episode_names())
+ replayed_episode_count = 0
+
+ with contextlib.suppress(KeyboardInterrupt) and torch.inference_mode():
+ while simulation_app.is_running() and not simulation_app.is_exiting():
+ env_episode_data_map = {i: EpisodeData() for i in range(num_envs)}
+ has_next_action = True
+ while has_next_action:
+ actions = idle_action.clone()
+ has_next_action = False
+ for env_id in range(num_envs):
+ env_next_action = get_next_action(
+ env_episode_data_map[env_id],
+ return_state=args_cli.replay_mode == "state",
+ )
+ if env_next_action is None:
+ next_episode_index = None
+ while episode_indices_to_replay:
+ next_episode_index = episode_indices_to_replay.pop(0)
+ if next_episode_index < episode_count:
+ break
+ next_episode_index = None
+
+ if next_episode_index is not None:
+ replayed_episode_count += 1
+ print(f"{replayed_episode_count:4}: Loading #{next_episode_index} episode to env_{env_id}")
+ episode_data = dataset_file_handler.load_episode(
+ episode_names[next_episode_index], env.device
+ )
+ env_episode_data_map[env_id] = episode_data
+ initial_state = episode_data.get_initial_state()
+ env.reset_to(
+ initial_state,
+ torch.tensor([env_id], device=env.device),
+ seed=int(episode_data.seed) if episode_data.seed is not None else None,
+ is_relative=True,
+ )
+ env_next_action = get_next_action(
+ env_episode_data_map[env_id],
+ return_state=args_cli.replay_mode == "state",
+ )
+ has_next_action = True
+ else:
+ continue
+ else:
+ has_next_action = True
+ actions[env_id] = env_next_action
+
+ env.step(actions)
+ break
+
+ print(f"Finished replaying {replayed_episode_count} episode{'s' if replayed_episode_count != 1 else ''}.")
+ env.close()
+
+
+if __name__ == "__main__":
+ main()
+ simulation_app.close()
diff --git a/scripts/datagen/rl/train.py b/scripts/datagen/rl/train.py
new file mode 100644
index 00000000..9db31b75
--- /dev/null
+++ b/scripts/datagen/rl/train.py
@@ -0,0 +1,120 @@
+"""Train a leisaac RL task with RSL-RL (PPO) via IsaacLab's training script."""
+
+import os
+import sys
+
+# Make cli_args importable (local module in IsaacLab's rsl_rl script directory)
+_ISAACLAB_RSL_RL_DIR = os.path.join(
+ os.path.dirname(__file__),
+ "..",
+ "..",
+ "..",
+ "dependencies",
+ "IsaacLab",
+ "scripts",
+ "reinforcement_learning",
+ "rsl_rl",
+)
+sys.path.insert(0, os.path.abspath(_ISAACLAB_RSL_RL_DIR))
+
+import argparse
+
+from isaaclab.app import AppLauncher
+
+import cli_args # isort: skip
+
+parser = argparse.ArgumentParser(description="Train a leisaac RL task with RSL-RL.")
+parser.add_argument("--num_envs", type=int, default=None, help="Number of environments to simulate.")
+parser.add_argument("--task", type=str, default=None, help="Name of the task.")
+parser.add_argument(
+ "--agent", type=str, default="rsl_rl_cfg_entry_point", help="Name of the RL agent configuration entry point."
+)
+parser.add_argument("--seed", type=int, default=None, help="Seed used for the environment.")
+parser.add_argument("--max_iterations", type=int, default=None, help="RL policy training iterations.")
+parser.add_argument(
+ "--distributed", action="store_true", default=False, help="Run training with multiple GPUs or nodes."
+)
+cli_args.add_rsl_rl_args(parser)
+AppLauncher.add_app_launcher_args(parser)
+args_cli, hydra_args = parser.parse_known_args()
+
+sys.argv = [sys.argv[0]] + hydra_args
+
+app_launcher = AppLauncher(args_cli)
+simulation_app = app_launcher.app
+
+from datetime import datetime
+
+import gymnasium as gym
+import isaaclab_tasks # noqa: F401
+import leisaac.tasks # noqa: F401
+import torch
+from isaaclab.envs import ManagerBasedRLEnvCfg
+from isaaclab.utils.io import dump_yaml
+from isaaclab_rl.rsl_rl import RslRlBaseRunnerCfg, RslRlVecEnvWrapper
+from isaaclab_tasks.utils import get_checkpoint_path
+from isaaclab_tasks.utils.hydra import hydra_task_config
+from rsl_rl.runners import OnPolicyRunner
+
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+torch.backends.cudnn.deterministic = False
+torch.backends.cudnn.benchmark = False
+
+
+@hydra_task_config(args_cli.task, args_cli.agent)
+def main(env_cfg: ManagerBasedRLEnvCfg, agent_cfg: RslRlBaseRunnerCfg):
+ """Train with RSL-RL agent."""
+ agent_cfg = cli_args.update_rsl_rl_cfg(agent_cfg, args_cli)
+ env_cfg.scene.num_envs = args_cli.num_envs if args_cli.num_envs is not None else env_cfg.scene.num_envs
+ agent_cfg.max_iterations = (
+ args_cli.max_iterations if args_cli.max_iterations is not None else agent_cfg.max_iterations
+ )
+
+ env_cfg.seed = agent_cfg.seed
+ env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device
+
+ if args_cli.distributed and args_cli.device is not None and "cpu" in args_cli.device:
+ raise ValueError("Distributed training is not supported when using CPU device.")
+
+ if args_cli.distributed:
+ env_cfg.sim.device = f"cuda:{app_launcher.local_rank}"
+ agent_cfg.device = f"cuda:{app_launcher.local_rank}"
+ seed = agent_cfg.seed + app_launcher.local_rank
+ env_cfg.seed = seed
+ agent_cfg.seed = seed
+
+ log_root_path = os.path.abspath(os.path.join("logs", "rsl_rl", agent_cfg.experiment_name))
+ print(f"[INFO] Logging experiment in directory: {log_root_path}")
+ log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+ if agent_cfg.run_name:
+ log_dir += f"_{agent_cfg.run_name}"
+ log_dir = os.path.join(log_root_path, log_dir)
+
+ env_cfg.log_dir = log_dir
+
+ env = gym.make(args_cli.task, cfg=env_cfg)
+
+ if agent_cfg.resume:
+ resume_path = get_checkpoint_path(log_root_path, agent_cfg.load_run, agent_cfg.load_checkpoint)
+
+ env = RslRlVecEnvWrapper(env, clip_actions=agent_cfg.clip_actions)
+
+ runner = OnPolicyRunner(env, agent_cfg.to_dict(), log_dir=log_dir, device=agent_cfg.device)
+ runner.add_git_repo_to_log(__file__)
+
+ if agent_cfg.resume:
+ print(f"[INFO]: Loading model checkpoint from: {resume_path}")
+ runner.load(resume_path)
+
+ dump_yaml(os.path.join(log_dir, "params", "env.yaml"), env_cfg)
+ dump_yaml(os.path.join(log_dir, "params", "agent.yaml"), agent_cfg)
+
+ runner.learn(num_learning_iterations=agent_cfg.max_iterations, init_at_random_ep_len=True)
+
+ env.close()
+
+
+if __name__ == "__main__":
+ main()
+ simulation_app.close()
diff --git a/scripts/datagen/state_machine/generate.py b/scripts/datagen/state_machine/generate.py
index 6d73407e..cb402451 100644
--- a/scripts/datagen/state_machine/generate.py
+++ b/scripts/datagen/state_machine/generate.py
@@ -58,7 +58,7 @@
# Maps gym task id → (StateMachineClass, device_type)
TASK_REGISTRY = {
- "LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "so101_state_machine"),
+ "LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "ik_so101leader"),
}
diff --git a/scripts/datagen/state_machine/replay.py b/scripts/datagen/state_machine/replay.py
index e8744afc..7a2224aa 100644
--- a/scripts/datagen/state_machine/replay.py
+++ b/scripts/datagen/state_machine/replay.py
@@ -35,8 +35,8 @@
type=str,
default=None,
help=(
- "State machine device type used during recording, e.g. 'so101_state_machine' or "
- "'bi_so101_state_machine'. If not set, inferred from the task name."
+ "State machine device type used during recording, e.g. 'ik_so101leader' or "
+ "'bi_ik_so101leader'. If not set, inferred from the task name."
),
)
@@ -83,7 +83,7 @@ def get_next_action(episode_data: EpisodeData, return_state: bool = False, task_
next_state = episode_data.get_next_state()
if next_state is None:
return None
- if task_type == "bi_so101_state_machine":
+ if task_type == "bi_ik_so101leader":
left_joint_pos = next_state["articulation"]["left_arm"]["joint_position"]
right_joint_pos = next_state["articulation"]["right_arm"]["joint_position"]
return torch.cat([left_joint_pos, right_joint_pos], dim=0)
@@ -95,9 +95,9 @@ def get_next_action(episode_data: EpisodeData, return_state: bool = False, task_
def apply_damping(env, task_type: str):
"""Apply joint damping each step to match state-machine recording behavior."""
- if task_type == "so101_state_machine":
+ if task_type == "ik_so101leader":
env.scene["robot"].write_joint_damping_to_sim(damping=10.0)
- elif task_type == "bi_so101_state_machine":
+ elif task_type == "bi_ik_so101leader":
env.scene["left_arm"].write_joint_damping_to_sim(damping=10.0)
env.scene["right_arm"].write_joint_damping_to_sim(damping=10.0)
diff --git a/source/leisaac/leisaac/devices/action_process.py b/source/leisaac/leisaac/devices/action_process.py
index 6abcc5dd..665446c9 100644
--- a/source/leisaac/leisaac/devices/action_process.py
+++ b/source/leisaac/leisaac/devices/action_process.py
@@ -76,7 +76,7 @@ def init_action_cfg(action_cfg, device):
joint_names=["gripper"],
scale=1.0,
)
- elif device in ["so101_state_machine"]: # IK-based: action = EE pose (7D) + binary gripper, not raw joint angles
+ elif device in ["ik_so101leader"]: # IK-based: action = EE pose (7D) + binary gripper, not raw joint angles
action_cfg.arm_action = mdp.DifferentialInverseKinematicsActionCfg(
asset_name="robot",
joint_names=["shoulder_pan", "shoulder_lift", "elbow_flex", "wrist_flex", "wrist_roll"],
@@ -91,7 +91,25 @@ def init_action_cfg(action_cfg, device):
open_command_expr={"gripper": 1.0},
close_command_expr={"gripper": 0.4},
)
- elif device in ["bi_so101_state_machine"]: # IK-based: action = EE pose (7D) + binary gripper, not raw joint angles
+ elif device in ["rl_so101leader"]: # RL: delta EE pose (6D) + binary gripper (1D) = 7D total
+ # use_relative_mode=True: action = (dx, dy, dz, droll, dpitch, dyaw)
+ # translation scale=0.02 → ±2cm/step; rotation scale=0.5 → ±0.5rad/step (~28.6°) at max action
+ action_cfg.arm_action = mdp.DifferentialInverseKinematicsActionCfg(
+ asset_name="robot",
+ joint_names=["shoulder_pan", "shoulder_lift", "elbow_flex", "wrist_flex", "wrist_roll"],
+ body_name="gripper",
+ scale=(0.02, 0.02, 0.02, 0.5, 0.5, 0.5),
+ controller=mdp.DifferentialIKControllerCfg(
+ command_type="pose", ik_method="dls", use_relative_mode=True, ik_params={"lambda_val": 0.04}
+ ),
+ )
+ action_cfg.gripper_action = mdp.BinaryJointPositionActionCfg(
+ asset_name="robot",
+ joint_names=["gripper"],
+ open_command_expr={"gripper": 1.0},
+ close_command_expr={"gripper": 0.2},
+ )
+ elif device in ["bi_ik_so101leader"]: # IK-based: action = EE pose (7D) + binary gripper, not raw joint angles
action_cfg.left_arm_action = mdp.DifferentialInverseKinematicsActionCfg(
asset_name="left_arm",
joint_names=["shoulder_pan", "shoulder_lift", "elbow_flex", "wrist_flex", "wrist_roll"],
@@ -120,6 +138,38 @@ def init_action_cfg(action_cfg, device):
open_command_expr={"gripper": 1.0},
close_command_expr={"gripper": 0.01},
)
+ elif device in ["bi_rl_so101leader"]: # RL: delta EE control for both arms, 6D×2 + binary gripper×2 = 14D total
+ # use_relative_mode=True: action = (dx, dy, dz, droll, dpitch, dyaw) per arm, scale=0.05 → ±5cm/step
+ action_cfg.left_arm_action = mdp.DifferentialInverseKinematicsActionCfg(
+ asset_name="left_arm",
+ joint_names=["shoulder_pan", "shoulder_lift", "elbow_flex", "wrist_flex", "wrist_roll"],
+ body_name="gripper",
+ scale=0.05,
+ controller=mdp.DifferentialIKControllerCfg(
+ command_type="pose", ik_method="dls", use_relative_mode=True, ik_params={"lambda_val": 0.04}
+ ),
+ )
+ action_cfg.left_gripper_action = mdp.BinaryJointPositionActionCfg(
+ asset_name="left_arm",
+ joint_names=["gripper"],
+ open_command_expr={"gripper": 1.0},
+ close_command_expr={"gripper": 0.01},
+ )
+ action_cfg.right_arm_action = mdp.DifferentialInverseKinematicsActionCfg(
+ asset_name="right_arm",
+ joint_names=["shoulder_pan", "shoulder_lift", "elbow_flex", "wrist_flex", "wrist_roll"],
+ body_name="gripper",
+ scale=0.05,
+ controller=mdp.DifferentialIKControllerCfg(
+ command_type="pose", ik_method="dls", use_relative_mode=True, ik_params={"lambda_val": 0.04}
+ ),
+ )
+ action_cfg.right_gripper_action = mdp.BinaryJointPositionActionCfg(
+ asset_name="right_arm",
+ joint_names=["gripper"],
+ open_command_expr={"gripper": 1.0},
+ close_command_expr={"gripper": 0.01},
+ )
"""LeKiwi action configuration"""
if device in ["lekiwi-leader", "lekiwi-keyboard", "lekiwi-gamepad"]:
action_cfg.wheel_action = mdp.JointVelocityActionCfg(
diff --git a/source/leisaac/leisaac/tasks/lift_cube/__init__.py b/source/leisaac/leisaac/tasks/lift_cube/__init__.py
index 54e2d807..71f13b21 100644
--- a/source/leisaac/leisaac/tasks/lift_cube/__init__.py
+++ b/source/leisaac/leisaac/tasks/lift_cube/__init__.py
@@ -1,5 +1,7 @@
import gymnasium as gym
+from . import rl_agents
+
gym.register(
id="LeIsaac-SO101-LiftCube-v0",
entry_point="isaaclab.envs:ManagerBasedRLEnv",
@@ -9,6 +11,16 @@
},
)
+gym.register(
+ id="LeIsaac-SO101-LiftCube-RL-v0",
+ entry_point="isaaclab.envs:ManagerBasedRLEnv",
+ disable_env_checker=True,
+ kwargs={
+ "env_cfg_entry_point": f"{__name__}.lift_cube_rl_env_cfg:LiftCubeRLEnvCfg",
+ "rsl_rl_cfg_entry_point": f"{rl_agents.__name__}.rsl_rl_ppo_cfg:LiftCubeRLPPORunnerCfg",
+ },
+)
+
gym.register(
id="LeIsaac-SO101-LiftCube-DigitalTwin-v0",
entry_point="leisaac.enhance.envs:ManagerBasedRLDigitalTwinEnv",
@@ -20,7 +32,7 @@
gym.register(
id="LeIsaac-SO101-LiftCube-Mimic-v0",
- entry_point=f"leisaac.enhance.envs:ManagerBasedRLLeIsaacMimicEnv",
+ entry_point="leisaac.enhance.envs:ManagerBasedRLLeIsaacMimicEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.lift_cube_mimic_env_cfg:LiftCubeMimicEnvCfg",
diff --git a/source/leisaac/leisaac/tasks/lift_cube/lift_cube_rl_env_cfg.py b/source/leisaac/leisaac/tasks/lift_cube/lift_cube_rl_env_cfg.py
new file mode 100644
index 00000000..cad82d03
--- /dev/null
+++ b/source/leisaac/leisaac/tasks/lift_cube/lift_cube_rl_env_cfg.py
@@ -0,0 +1,122 @@
+import math
+
+from isaaclab.managers import ObservationGroupCfg as ObsGroup
+from isaaclab.managers import ObservationTermCfg as ObsTerm
+from isaaclab.managers import RewardTermCfg as RewTerm
+from isaaclab.managers import SceneEntityCfg
+from isaaclab.managers import TerminationTermCfg as DoneTerm
+from isaaclab.utils import configclass
+
+from . import mdp
+from .lift_cube_env_cfg import LiftCubeEnvCfg, LiftCubeSceneCfg
+
+_CUBE_CFG = SceneEntityCfg("cube")
+_ROBOT_CFG = SceneEntityCfg("robot")
+
+
+@configclass
+class LiftCubeRLObservationsCfg:
+ """Flat vector observations for RL (26D total).
+
+ joint_pos(6) + joint_vel(6) + ee_frame_state(7) + cube_rel_ee(3) + cube_quat(4) = 26D
+ """
+
+ @configclass
+ class PolicyCfg(ObsGroup):
+ joint_pos = ObsTerm(func=mdp.joint_pos)
+ joint_vel = ObsTerm(func=mdp.joint_vel)
+ ee_frame_state = ObsTerm(
+ func=mdp.ee_frame_state,
+ params={"ee_frame_cfg": SceneEntityCfg("ee_frame"), "robot_cfg": _ROBOT_CFG},
+ )
+ cube_pos_relative_to_ee = ObsTerm(
+ func=mdp.cube_pos_relative_to_ee,
+ params={"cube_cfg": _CUBE_CFG, "ee_frame_cfg": SceneEntityCfg("ee_frame")},
+ )
+ cube_quat = ObsTerm(
+ func=mdp.cube_quat,
+ params={"cube_cfg": _CUBE_CFG},
+ )
+
+ def __post_init__(self):
+ self.enable_corruption = False
+ self.concatenate_terms = True
+
+ policy: PolicyCfg = PolicyCfg()
+
+
+@configclass
+class LiftCubeRLRewardsCfg:
+ """Reward terms for lift-cube RL training."""
+
+ cube_success = RewTerm(
+ func=mdp.cube_success_bonus,
+ weight=200.0,
+ params={"cube_cfg": _CUBE_CFG, "robot_cfg": _ROBOT_CFG, "height_threshold": 0.20},
+ )
+ ee_to_cube = RewTerm(
+ func=mdp.ee_to_cube_reward,
+ weight=1.5,
+ params={"cube_cfg": _CUBE_CFG, "robot_cfg": _ROBOT_CFG},
+ )
+ cube_height = RewTerm(
+ func=mdp.cube_height_reward,
+ weight=10.0,
+ params={"cube_cfg": _CUBE_CFG, "robot_cfg": _ROBOT_CFG, "min_height": 0.046, "k": 3.0},
+ )
+
+
+@configclass
+class LiftCubeRLTerminationsCfg:
+ """Terminations: timeout + early success."""
+
+ time_out = DoneTerm(func=mdp.time_out, time_out=True)
+ success = DoneTerm(
+ func=mdp.cube_height_above_base,
+ params={"cube_cfg": _CUBE_CFG, "robot_cfg": _ROBOT_CFG, "height_threshold": 0.20},
+ )
+
+
+@configclass
+class LiftCubeRLEnvCfg(LiftCubeEnvCfg):
+ """RL-specific configuration for the lift-cube environment.
+
+ Observations: 26D flat vector.
+ Action space: 7D (rl_so101leader: 6D delta EE pose + 1D binary gripper).
+ Cameras disabled for faster training.
+ """
+
+ scene: LiftCubeSceneCfg = LiftCubeSceneCfg(env_spacing=8.0)
+ observations: LiftCubeRLObservationsCfg = LiftCubeRLObservationsCfg()
+ rewards: LiftCubeRLRewardsCfg = LiftCubeRLRewardsCfg()
+ terminations: LiftCubeRLTerminationsCfg = LiftCubeRLTerminationsCfg()
+
+ def __post_init__(self) -> None:
+ super().__post_init__()
+
+ self.use_teleop_device("rl_so101leader")
+
+ self.scene.front = None
+
+ for attr_name in list(vars(self.events).keys()):
+ term = getattr(self.events, attr_name, None)
+ if (
+ hasattr(term, "params")
+ and isinstance(term.params.get("asset_cfg"), SceneEntityCfg)
+ and term.params["asset_cfg"].name == "front"
+ ):
+ delattr(self.events, attr_name)
+
+ self.episode_length_s = 15.0
+
+ for actuator_cfg in self.scene.robot.actuators.values():
+ actuator_cfg.damping = 8.0
+
+ self.scene.robot.init_state.joint_pos = {
+ "shoulder_pan": 0.0,
+ "shoulder_lift": math.radians(-100.0),
+ "elbow_flex": math.radians(90.0),
+ "wrist_flex": math.radians(50.0),
+ "wrist_roll": 0.0,
+ "gripper": 1.0,
+ }
diff --git a/source/leisaac/leisaac/tasks/lift_cube/mdp/__init__.py b/source/leisaac/leisaac/tasks/lift_cube/mdp/__init__.py
index 38ebdb6c..1d602cf2 100644
--- a/source/leisaac/leisaac/tasks/lift_cube/mdp/__init__.py
+++ b/source/leisaac/leisaac/tasks/lift_cube/mdp/__init__.py
@@ -2,4 +2,5 @@
from leisaac.enhance.envs.mdp import *
from .observations import *
+from .rewards import *
from .terminations import *
diff --git a/source/leisaac/leisaac/tasks/lift_cube/mdp/observations.py b/source/leisaac/leisaac/tasks/lift_cube/mdp/observations.py
index eb808e5b..985a4af1 100644
--- a/source/leisaac/leisaac/tasks/lift_cube/mdp/observations.py
+++ b/source/leisaac/leisaac/tasks/lift_cube/mdp/observations.py
@@ -25,3 +25,24 @@ def object_grasped(
grasped = torch.logical_and(pos_diff < diff_threshold, robot.data.joint_pos[:, -1] < grasp_threshold)
return grasped
+
+
+def cube_quat(
+ env: ManagerBasedRLEnv,
+ cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),
+) -> torch.Tensor:
+ """Returns cube orientation as quaternion (w, x, y, z) in world frame. (num_envs, 4)"""
+ cube: RigidObject = env.scene[cube_cfg.name]
+ return cube.data.root_quat_w
+
+
+def cube_pos_relative_to_ee(
+ env: ManagerBasedRLEnv,
+ cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),
+ ee_frame_cfg: SceneEntityCfg = SceneEntityCfg("ee_frame"),
+) -> torch.Tensor:
+ """Returns cube position relative to the EE jaw. (num_envs, 3)"""
+ ee_frame: FrameTransformer = env.scene[ee_frame_cfg.name]
+ ee_pos = ee_frame.data.target_pos_w[:, 1, :] # jaw position
+ cube: RigidObject = env.scene[cube_cfg.name]
+ return cube.data.root_pos_w - ee_pos
diff --git a/source/leisaac/leisaac/tasks/lift_cube/mdp/rewards.py b/source/leisaac/leisaac/tasks/lift_cube/mdp/rewards.py
new file mode 100644
index 00000000..c164e275
--- /dev/null
+++ b/source/leisaac/leisaac/tasks/lift_cube/mdp/rewards.py
@@ -0,0 +1,113 @@
+"""Reward functions for the LiftCube RL task.
+
+Reward structure
+----------------
+The task uses three complementary reward terms to guide the policy through a
+natural lifting curriculum:
+
+1. ee_to_cube (weight 1.5, shaped)
+ Guides the end-effector toward the cube before and during grasping.
+ Formula: 1 - tanh(k * dist(TCP, cube))
+ - Peaks at 1.0 when TCP is exactly at the cube center, decays smoothly with distance.
+ - k = 5.0 gives a reach radius of ~10 cm before the reward drops below 0.5.
+
+2. cube_height (weight 10.0, shaped)
+ Encourages lifting the cube above the table surface.
+ Formula: tanh(k * max(h - min_height, 0))
+ - Zero below min_height (4.6 cm above robot base) — ignores ground-level noise.
+ - Monotonically increasing above min_height. Range [0, 1].
+
+3. cube_success (weight 200.0, sparse)
+ One-time terminal bonus when h >= 20 cm. Also triggers early episode
+ termination so the policy does not receive further shaped reward after success.
+
+TCP computation
+---------------
+The Tool Center Point (TCP) is the midpoint between the two fingertip contact
+surfaces, computed from body_pos_w / body_quat_w plus calibrated local offsets
+derived from the USD collision mesh geometry:
+ - jaw body local frame : (0.0, -0.05, 0.02)
+ - gripper body local frame : (-0.012, 0.0, -0.08)
+
+Height reference
+----------------
+All heights are measured relative to the robot base link ("base") in the world
+frame, making the reward invariant to the absolute table / mount height.
+"""
+
+import torch
+from isaaclab.assets import Articulation, RigidObject
+from isaaclab.envs import ManagerBasedRLEnv
+from isaaclab.managers import SceneEntityCfg
+from isaaclab.utils.math import quat_apply
+
+# Fingertip offsets in each body's local frame, calibrated from USD collision mesh geometry
+_JAW_TIP_OFFSET = torch.tensor([0.0, -0.05, 0.02])
+_GRIPPER_TIP_OFFSET = torch.tensor([-0.012, 0.0, -0.08])
+
+
+def _tcp_pos(env: ManagerBasedRLEnv, robot_cfg: SceneEntityCfg) -> torch.Tensor:
+ """TCP position: midpoint between jaw tip and gripper tip in world frame. (num_envs, 3)"""
+ robot: Articulation = env.scene[robot_cfg.name]
+ jaw_idx = robot.data.body_names.index("jaw")
+ gripper_idx = robot.data.body_names.index("gripper")
+
+ jaw_pos = robot.data.body_pos_w[:, jaw_idx, :]
+ jaw_quat = robot.data.body_quat_w[:, jaw_idx, :]
+ gripper_pos = robot.data.body_pos_w[:, gripper_idx, :]
+ gripper_quat = robot.data.body_quat_w[:, gripper_idx, :]
+
+ device = jaw_pos.device
+ N = jaw_pos.shape[0]
+ jaw_offset = _JAW_TIP_OFFSET.to(device).unsqueeze(0).expand(N, -1)
+ gripper_offset = _GRIPPER_TIP_OFFSET.to(device).unsqueeze(0).expand(N, -1)
+
+ jaw_tip = jaw_pos + quat_apply(jaw_quat, jaw_offset)
+ gripper_tip = gripper_pos + quat_apply(gripper_quat, gripper_offset)
+ return (jaw_tip + gripper_tip) / 2.0
+
+
+def _cube_pos(env: ManagerBasedRLEnv, cube_cfg: SceneEntityCfg) -> torch.Tensor:
+ cube: RigidObject = env.scene[cube_cfg.name]
+ return cube.data.root_pos_w # (num_envs, 3)
+
+
+def _robot_base_height(env: ManagerBasedRLEnv, robot_cfg: SceneEntityCfg, base_name: str = "base") -> torch.Tensor:
+ robot: Articulation = env.scene[robot_cfg.name]
+ base_index = robot.data.body_names.index(base_name)
+ return robot.data.body_pos_w[:, base_index, 2] # (num_envs,)
+
+
+def ee_to_cube_reward(
+ env: ManagerBasedRLEnv,
+ cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),
+ robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"),
+ k: float = 5.0,
+) -> torch.Tensor:
+ """Reaching reward: 1 - tanh(k * dist) from TCP (grasp midpoint) to cube center. Range [0, 1]."""
+ dist = torch.linalg.vector_norm(_cube_pos(env, cube_cfg) - _tcp_pos(env, robot_cfg), dim=1)
+ return 1.0 - torch.tanh(k * dist)
+
+
+def cube_height_reward(
+ env: ManagerBasedRLEnv,
+ cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),
+ robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"),
+ min_height: float = 0.046,
+ k: float = 5.0,
+) -> torch.Tensor:
+ """Height reward: tanh(k * max(h - min_height, 0)). Zero below min_height, monotonically increasing above. Range [0, 1]."""
+ height_above_base = _cube_pos(env, cube_cfg)[:, 2] - _robot_base_height(env, robot_cfg)
+ h = torch.clamp(height_above_base - min_height, min=0.0)
+ return torch.tanh(k * h)
+
+
+def cube_success_bonus(
+ env: ManagerBasedRLEnv,
+ cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),
+ robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"),
+ height_threshold: float = 0.20,
+) -> torch.Tensor:
+ """One-time large bonus when cube reaches the success height threshold."""
+ height_above_base = _cube_pos(env, cube_cfg)[:, 2] - _robot_base_height(env, robot_cfg)
+ return (height_above_base >= height_threshold).float()
diff --git a/source/leisaac/leisaac/tasks/lift_cube/rl_agents/__init__.py b/source/leisaac/leisaac/tasks/lift_cube/rl_agents/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/source/leisaac/leisaac/tasks/lift_cube/rl_agents/rsl_rl_ppo_cfg.py b/source/leisaac/leisaac/tasks/lift_cube/rl_agents/rsl_rl_ppo_cfg.py
new file mode 100644
index 00000000..c8d68ad3
--- /dev/null
+++ b/source/leisaac/leisaac/tasks/lift_cube/rl_agents/rsl_rl_ppo_cfg.py
@@ -0,0 +1,39 @@
+from isaaclab.utils import configclass
+from isaaclab_rl.rsl_rl import (
+ RslRlOnPolicyRunnerCfg,
+ RslRlPpoActorCriticCfg,
+ RslRlPpoAlgorithmCfg,
+)
+
+
+@configclass
+class LiftCubeRLPPORunnerCfg(RslRlOnPolicyRunnerCfg):
+ num_steps_per_env = 100
+ max_iterations = 1500
+ save_interval = 50
+ experiment_name = "lift_cube_rl"
+ obs_groups = {"actor": ["policy"], "critic": ["policy"]}
+
+ policy = RslRlPpoActorCriticCfg(
+ init_noise_std=0.3,
+ actor_obs_normalization=True,
+ critic_obs_normalization=True,
+ actor_hidden_dims=[256, 128, 64],
+ critic_hidden_dims=[256, 128, 64],
+ activation="elu",
+ )
+
+ algorithm = RslRlPpoAlgorithmCfg(
+ value_loss_coef=1.0,
+ use_clipped_value_loss=True,
+ clip_param=0.1,
+ entropy_coef=0.005,
+ num_learning_epochs=5,
+ num_mini_batches=4,
+ learning_rate=1.0e-4,
+ schedule="adaptive",
+ gamma=0.95,
+ lam=0.95,
+ desired_kl=0.01,
+ max_grad_norm=0.5,
+ )
diff --git a/source/leisaac/leisaac/tasks/pick_orange/agents/__init__.py b/source/leisaac/leisaac/tasks/pick_orange/agents/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/source/leisaac/leisaac/tasks/template/bi_arm_env_cfg.py b/source/leisaac/leisaac/tasks/template/bi_arm_env_cfg.py
index 6a07c6ff..ea2d2af9 100644
--- a/source/leisaac/leisaac/tasks/template/bi_arm_env_cfg.py
+++ b/source/leisaac/leisaac/tasks/template/bi_arm_env_cfg.py
@@ -211,7 +211,7 @@ def __post_init__(self) -> None:
def use_teleop_device(self, teleop_device) -> None:
self.task_type = teleop_device
self.actions = init_action_cfg(self.actions, device=teleop_device)
- if teleop_device in ["bi_so101_state_machine"]:
+ if teleop_device in ["bi_ik_so101leader"]:
self.scene.left_arm.spawn.rigid_props.disable_gravity = True
self.scene.right_arm.spawn.rigid_props.disable_gravity = True
diff --git a/source/leisaac/leisaac/tasks/template/direct/bi_arm_env.py b/source/leisaac/leisaac/tasks/template/direct/bi_arm_env.py
index 614cdd5f..f0a92960 100644
--- a/source/leisaac/leisaac/tasks/template/direct/bi_arm_env.py
+++ b/source/leisaac/leisaac/tasks/template/direct/bi_arm_env.py
@@ -92,7 +92,7 @@ def __post_init__(self) -> None:
def use_teleop_device(self, teleop_device) -> None:
self.task_type = teleop_device
# self.actions = init_action_cfg(self.actions, device=teleop_device)
- if teleop_device in ["bi_so101_state_machine"]:
+ if teleop_device in ["bi_ik_so101leader"]:
self.scene.left_arm.spawn.rigid_props.disable_gravity = True
self.scene.right_arm.spawn.rigid_props.disable_gravity = True
diff --git a/source/leisaac/leisaac/tasks/template/direct/single_arm_env.py b/source/leisaac/leisaac/tasks/template/direct/single_arm_env.py
index aa7ea7a6..147cd370 100644
--- a/source/leisaac/leisaac/tasks/template/direct/single_arm_env.py
+++ b/source/leisaac/leisaac/tasks/template/direct/single_arm_env.py
@@ -83,7 +83,7 @@ def __post_init__(self) -> None:
def use_teleop_device(self, teleop_device) -> None:
self.task_type = teleop_device
- if teleop_device in ["keyboard", "gamepad", "so101_state_machine"]:
+ if teleop_device in ["keyboard", "gamepad", "ik_so101leader"]:
self.scene.robot.spawn.rigid_props.disable_gravity = True
def preprocess_device_action(self, action: dict[str, Any], teleop_device) -> torch.Tensor:
diff --git a/source/leisaac/leisaac/tasks/template/single_arm_env_cfg.py b/source/leisaac/leisaac/tasks/template/single_arm_env_cfg.py
index b11664a3..a2acdeb1 100644
--- a/source/leisaac/leisaac/tasks/template/single_arm_env_cfg.py
+++ b/source/leisaac/leisaac/tasks/template/single_arm_env_cfg.py
@@ -194,7 +194,7 @@ def __post_init__(self) -> None:
def use_teleop_device(self, teleop_device) -> None:
self.task_type = teleop_device
self.actions = init_action_cfg(self.actions, device=teleop_device)
- if teleop_device in ["keyboard", "gamepad", "so101_state_machine"]:
+ if teleop_device in ["keyboard", "gamepad", "ik_so101leader", "rl_so101leader"]:
self.scene.robot.spawn.rigid_props.disable_gravity = True
def preprocess_device_action(self, action: dict[str, Any], teleop_device) -> torch.Tensor: