Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
f870f10
Add data auto-generate module.
Papaercold Jan 27, 2026
e52df2e
Add data auto-generate module.
Papaercold Jan 27, 2026
ae15ddc
Add data auto-generate module.
Papaercold Jan 27, 2026
4fea004
Merge branch 'LightwheelAI:main' into auto-data-generation
Papaercold Jan 27, 2026
a245712
Add data auto-generate module.
Papaercold Jan 27, 2026
38df5fc
Add data auto-generate module.
Papaercold Jan 27, 2026
83b9e0b
Add auto_terminate.
Papaercold Jan 28, 2026
3b78d34
Add auto_terminate.
Papaercold Jan 28, 2026
47d8504
Add auto_terminate.
Papaercold Jan 28, 2026
2e448cf
Add Description.
Papaercold Jan 28, 2026
0e5f008
State Machinecode refactoring.
Papaercold Jan 28, 2026
9750b3d
State Machinecode refactoring.
Papaercold Jan 28, 2026
34be002
State Machinecode refactoring.
Papaercold Jan 28, 2026
d590d27
State Machinecode refactoring.
Papaercold Jan 28, 2026
b8c97a9
State Machinecode refactoring.
Papaercold Jan 28, 2026
4567af7
State Machinecode refactoring.
Papaercold Jan 28, 2026
6ad6606
Add State Machine code.
Papaercold Jan 28, 2026
e694585
Apply pre-commit fixes (black/isort/pyupgrade) for several files
Papaercold Feb 18, 2026
5074699
Apply pre-commit fixes (black/isort/pyupgrade) for pick_orange.py
Papaercold Feb 18, 2026
18da4fa
Merge branch 'LightwheelAI:main' into auto-data-generation
Papaercold Feb 18, 2026
9e73433
Change structure..
Papaercold Feb 18, 2026
310aee6
Create StateMacchine Class.
Papaercold Feb 18, 2026
fb8304d
Refactor code.
Papaercold Feb 18, 2026
9328529
Fix bugs.
Papaercold Feb 19, 2026
a5338d6
Delete redundant files
Papaercold Feb 19, 2026
e1c8729
Delete redundant files.
Papaercold Feb 19, 2026
c4f8111
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
e42fc85
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
002133f
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
2460e14
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
a748872
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
c07cd7f
Change PickOrangeStateMachine
Papaercold Feb 19, 2026
9f0bdcd
Add state_machine/fold_cloth.py
Papaercold Feb 21, 2026
656fa77
Fix bugs
Papaercold Feb 21, 2026
30f8061
Add state_machine/replay.py
Papaercold Feb 21, 2026
2c8a938
Add readme
Papaercold Feb 21, 2026
f81ba70
fix bugs
Papaercold Feb 21, 2026
d1c3cef
fix bugs
Papaercold Feb 21, 2026
e81f2dd
fix bugs
Papaercold Feb 21, 2026
246a48f
fix bugs
Papaercold Feb 21, 2026
733de37
fix bugs
Papaercold Feb 21, 2026
1bde9b0
Change documents
Papaercold Feb 21, 2026
1342c18
Change bi_arm_cfg
Papaercold Feb 22, 2026
c7d8cca
Add RL module - 1st version.
Papaercold Mar 2, 2026
5dc5b8d
Change documents.
Papaercold Mar 2, 2026
8d1c810
Change bash.
Papaercold Mar 2, 2026
f34cb6e
Delete RL part.
Papaercold Mar 3, 2026
db99a6c
Refactor
Papaercold Mar 6, 2026
f26debf
Change format
Papaercold Mar 6, 2026
f4d40b7
Refactor
Papaercold Mar 6, 2026
88d5c41
Change Isaaclab version==2.3.2
Papaercold Mar 6, 2026
87ced23
Change Isaaclab version==2.3.0
Papaercold Mar 6, 2026
65484c5
Change documents.
Papaercold Mar 6, 2026
5155f6a
Fix bugs.
Papaercold Mar 6, 2026
1d3a912
Change format.
Papaercold Mar 6, 2026
4faee0c
Change format.
Papaercold Mar 6, 2026
360484d
Merge pull request #1 from Papaercold/auto-data-generation
Papaercold Mar 10, 2026
b3796ab
Add RL module
Papaercold Mar 8, 2026
c7013c5
Change docstring format
Papaercold Mar 10, 2026
c8f3202
Move the RL directory into the datagen directory
Papaercold Mar 10, 2026
ecf0153
Update state machine documentation to align with the latest changes
Papaercold Mar 10, 2026
4978c3f
Hyperparameter tuning
Papaercold Mar 11, 2026
4a808fe
Modify hyperparameters and rewards
Papaercold Mar 12, 2026
d232a40
Update documents
Papaercold Mar 12, 2026
e01549c
Refactor the validation module into a record module and add signal ha…
Papaercold Mar 12, 2026
3301e5a
Change docstrings
Papaercold Mar 12, 2026
fb90080
Change reward weights
Papaercold Mar 12, 2026
723238e
Standardize dataset recording behavior
Papaercold Mar 12, 2026
3f2a940
Modify the reward function, replacing the linear function with tanh t…
Papaercold Mar 12, 2026
d670f1b
Modify the reward function, replacing the linear function with tanh t…
Papaercold Mar 12, 2026
1080309
Reconstruct the RL module to keep it consistent with Isaac Lab.
Papaercold Mar 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 212 additions & 0 deletions docs/docs/docs/features/rl_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# RL Training

The RL training module enables training manipulation policies with reinforcement learning using [rsl_rl](https://github.com/leggedrobotics/rsl_rl) (PPO). It runs fully in simulation with parallel environments and no human teleoperation required.

:::note
End-to-end RL for manipulation is challenging — reward design, exploration, and sim-to-real transfer all require significant task-specific tuning. Currently only the **LiftCube** task is supported. Support for additional tasks will be added in future updates.
:::

## Training

```shell
python scripts/datagen/rl/train.py \
--task LeIsaac-SO101-LiftCube-RL-v0 \
--num_envs 512 \
--max_iterations 1500 \
--headless
```

<details>
<summary><strong>Parameter descriptions for train.py</strong></summary>

- `--task`: Gym task ID to train. Required.

- `--num_envs`: Number of parallel simulation environments. More environments = faster data collection. Default: value from task config.

- `--max_iterations`: Number of PPO update iterations. Default: value from agent config.

- `--seed`: Random seed for reproducibility. Default: value from agent config.

- `--headless`: Run without rendering window for faster training.

- `--device`: Computation device, such as `cpu` or `cuda`.

</details>

::::tip
Training logs (tensorboard) are written to `logs/rsl_rl/<experiment_name>/<timestamp>/`. Monitor progress with:

```shell
tensorboard --logdir logs/rsl_rl
```

Key metrics to watch: `Train/mean_reward` (total episode reward) and individual reward terms such as `Episode/rew_cube_height`.
::::

## Evaluation & Recording

Evaluate a checkpoint visually (no recording):

```shell
python scripts/datagen/rl/play.py \
--task LeIsaac-SO101-LiftCube-RL-v0 \
--checkpoint logs/rsl_rl/lift_cube_rl/<run>/model_<iter>.pt \
--num_envs 1
```

Save all episodes to HDF5 (both success and failure) by adding `--record`:

```shell
python scripts/datagen/rl/play.py \
--task LeIsaac-SO101-LiftCube-RL-v0 \
--checkpoint logs/rsl_rl/lift_cube_rl/<run>/model_<iter>.pt \
--num_envs 1 \
--num_episodes 100 \
--record --dataset_file ./datasets/rl_eval.hdf5
```

<details>
<summary><strong>Parameter descriptions for play.py</strong></summary>

- `--task`: Gym task ID. Required.

- `--checkpoint`: Path to a saved model checkpoint (`.pt`). Required.

- `--num_envs`: Number of parallel environments. Default: value from task config.

- `--num_episodes`: Total episodes to run across all envs. `0` = run indefinitely. Default: `0`.

- `--seed`: Random seed. Default: value from agent config.

- `--record`: Enable HDF5 recording. Both successful and failed episodes are saved.

- `--resume_recording`: Append to an existing dataset file instead of creating a new one.

- `--dataset_file`: Output HDF5 file path. Default: `./datasets/rl_eval.hdf5`.

- `--real-time`: Slow down simulation to real-time speed.

</details>

## Reward Design

The LiftCube RL task uses three reward terms:

| Term | Weight | Description |
|------|--------|-------------|
| `cube_success` | 200.0 | One-time bonus when cube height ≥ 20 cm above robot base. Episode ends immediately after (early termination). |
| `ee_to_cube` | 1.5 | `1 - tanh(5 × dist(TCP, cube))` — guides TCP to cube center. Range [0, 1]. |
| `cube_height` | 10.0 | `tanh(3 × max(h - 4.6 cm, 0))` — zero below 4.6 cm, monotonically increasing above. Range [0, 1]. |

**TCP (Tool Center Point)** is computed as the midpoint between the two fingertip contact surfaces, derived from `body_pos_w` and `body_quat_w` with calibrated local offsets from the USD collision mesh:

- Jaw tip offset (jaw body local frame): `(0.0, -0.05, 0.02)`
- Gripper tip offset (gripper body local frame): `(-0.012, 0.0, -0.08)`

**Termination**: episode ends on timeout (15 s) or when cube height ≥ 20 cm (success).

## Action Space

RL training uses the `rl_so101leader` device mode — delta end-effector control with a binary gripper:

| Component | Dims | Description |
|-----------|------|-------------|
| `arm_action` | 6 | Delta EE pose (dx, dy, dz, droll, dpitch, dyaw), scale=(0.02, 0.02, 0.02, 0.5, 0.5, 0.5) → ±2 cm / ±0.5 rad per step |
| `gripper_action` | 1 | Binary: action > 0 → open (1.0 rad), action < 0 → close (0.2 rad) |
| **Total** | **7** | |

## Observation Space

26D flat vector (concatenated):

| Term | Dims |
|------|------|
| `joint_pos` | 6 |
| `joint_vel` | 6 |
| `ee_frame_state` (pos + quat, robot frame) | 7 |
| `cube_pos_relative_to_ee` | 3 |
| `cube_quat` (orientation in world frame) | 4 |
| **Total** | **26** |

## Adding a New RL Task

1. Create `<task>/mdp/rewards.py` with reward functions.
2. Create `<task>/<task>_rl_env_cfg.py` with the RL env config class:

```python
@configclass
class MyTaskRLEnvCfg(MyTaskEnvCfg):
observations: MyTaskRLObsCfg = MyTaskRLObsCfg()
rewards: MyTaskRLRewardsCfg = MyTaskRLRewardsCfg()
terminations: MyTaskRLTerminationsCfg = MyTaskRLTerminationsCfg()

def __post_init__(self):
super().__post_init__()
self.use_teleop_device("rl_so101leader") # or "bi_rl_so101leader" for bi-arm
self.scene.front = None # disable camera for faster training
self.episode_length_s = 15.0
```

3. Create `<task>/rl_agents/rsl_rl_ppo_cfg.py` with the PPO runner config:

```python
from isaaclab.utils import configclass
from isaaclab_rl.rsl_rl import RslRlOnPolicyRunnerCfg, RslRlPpoActorCriticCfg, RslRlPpoAlgorithmCfg

@configclass
class MyTaskRLPPORunnerCfg(RslRlOnPolicyRunnerCfg):
num_steps_per_env = 100
max_iterations = 1500
save_interval = 50
experiment_name = "my_task_rl"
obs_groups = {"actor": ["policy"], "critic": ["policy"]}

policy = RslRlPpoActorCriticCfg(
init_noise_std=0.3,
actor_obs_normalization=True,
critic_obs_normalization=True,
actor_hidden_dims=[256, 128, 64],
critic_hidden_dims=[256, 128, 64],
activation="elu",
)

algorithm = RslRlPpoAlgorithmCfg(
value_loss_coef=1.0,
use_clipped_value_loss=True,
clip_param=0.2,
entropy_coef=0.005,
num_learning_epochs=5,
num_mini_batches=4,
learning_rate=1.0e-3,
schedule="adaptive",
gamma=0.99,
lam=0.95,
desired_kl=0.01,
max_grad_norm=1.0,
)
```

4. Register the gym environment in `<task>/__init__.py`:

```python
from . import rl_agents

gym.register(
id="LeIsaac-SO101-MyTask-RL-v0",
entry_point="isaaclab.envs:ManagerBasedRLEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.<task>_rl_env_cfg:MyTaskRLEnvCfg",
"rsl_rl_cfg_entry_point": f"{rl_agents.__name__}.rsl_rl_ppo_cfg:MyTaskRLPPORunnerCfg",
},
)
```

5. Train:

```bash
python scripts/datagen/rl/train.py \
--task LeIsaac-SO101-MyTask-RL-v0 \
--num_envs 512 \
--headless
```
8 changes: 4 additions & 4 deletions docs/docs/docs/features/state_machine.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ After recording, you can replay the collected demonstrations in simulation:
python scripts/datagen/state_machine/replay.py \
--task LeIsaac-SO101-PickOrange-v0 \
--dataset_file ./datasets/pick_orange.hdf5 \
--task_type so101_state_machine \
--task_type ik_so101leader \
--select_episodes 0 \
--device cuda \
--enable_cameras \
Expand All @@ -82,7 +82,7 @@ python scripts/datagen/state_machine/replay.py \

- `--replay_mode`: Replay mode — `action` replays IK pose targets, `state` replays joint positions.

- `--task_type`: State machine device type used during recording, e.g., `so101_state_machine` or `bi_so101_state_machine`. Inferred from task name if not set.
- `--task_type`: State machine device type used during recording, e.g., `ik_so101leader` or `bi_ik_so101leader`. Inferred from task name if not set.

- `--select_episodes`: List of episode indices to replay. Leave empty to replay all episodes.

Expand All @@ -97,7 +97,7 @@ python scripts/datagen/state_machine/replay.py \

```python
TASK_REGISTRY = {
"LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "so101_state_machine"),
"LeIsaac-MY-NewTask-v0": (MyNewStateMachine, "so101_state_machine"),
"LeIsaac-SO101-PickOrange-v0": (PickOrangeStateMachine, "ik_so101leader"),
"LeIsaac-MY-NewTask-v0": (MyNewStateMachine, "ik_so101leader"),
}
```
6 changes: 6 additions & 0 deletions docs/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,12 @@ const sidebars = {
link: { type: 'doc', id: 'docs/features/state_machine' },
items: [],
},
{
type: 'category',
label: 'RL Training',
link: { type: 'doc', id: 'docs/features/rl_training' },
items: [],
},
],
},
'docs/trouble_shooting',
Expand Down
Loading
Loading