Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions labs/world_models/lab_dreamer_cartpole_pixels/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,21 @@ jupyter nbconvert --execute --to notebook --inplace notebook.ipynb

## 这个 lab 证明了什么

- **WM 学会了视觉动力学**:reconstruction strip 显示重建图像跟随真实 cart/pole 的位置与角度;
- **策略完全在 imagination 中改进**:actor-critic 从未直接看到像素,只看 (h_t, z_t);
- **重建质量决定上限**:长 horizon rollout 的 pixel MSE 单调增长,这条曲线就是世界模型的"梦境寿命",policy 在 imagination 里能挖到的回报受它约束。
- **WM 学会了视觉动力学**:reconstruction strip 显示重建图像跟随真实 cart/pole 的位置与角度,pixel MSE 从 ~1500 降到 ~40;
- **策略完全在 imagination 中训练**:actor-critic 从未直接看到像素,只看 (h_t, z_t)。整个 8-cycle 流程在 ~4 分钟内跑完;
- **重建质量决定上限**:15 步开环 latent rollout 的 pixel MSE 在 5 步后开始上升,说明 WM 的"梦境寿命"约束了 imagination-based policy 能挖到的回报。

### 关于策略提升的诚实说明

在 8 分钟 CPU 预算 + 仅约 1800 真实 env steps 下,actor-critic 的回报曲线
徘徊在随机策略水平(~20)附近,并未显著上升。可能原因:
(1) CartPole 的 reward 是常数 +1,WM 的 reward head 训练后梯度几乎为零,
关键信号其实来自 continue head;
(2) 在这点数据预算下 continue head 学到的区分能力有限;
(3) actor 在 imagination 中容易钻 WM 的局部不准确("想象里看似变好但真实环境里崩")。
原版 Dreamer 通常需要 1e5+ env steps 才能稳定收敛——本 lab 主打的是
"WM 学得到 + imagination 可视化 + 长 horizon 误差可量化"这一最小可复现核心,
policy improvement 部分仅作为框架完备性的演示,提升空间留给 stretch goals。

## 文件契约

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
238 changes: 219 additions & 19 deletions labs/world_models/lab_dreamer_cartpole_pixels/notebook.ipynb

Large diffs are not rendered by default.

7 changes: 5 additions & 2 deletions labs/world_models/lab_dreamer_cartpole_pixels/src/policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
2. Train the world model on observation sequences.
3. Sample starting latents from the buffer, imagine fixed-horizon rollouts
under the actor, compute lambda-returns from the predicted rewards and
critic, and update actor+critic by backprop through imagination.
critic, and update actor+critic with REINFORCE + value regression in the
imagination space (Dreamer V2 style for discrete actions).
4. Iterate 8 outer cycles.

Running ``python -m src.policy`` performs the full Dreamer cycle and saves
Expand Down Expand Up @@ -67,7 +68,9 @@ class PolicyConfig:
ac_updates_per_cycle: int = 50 # gradient steps on actor+critic each cycle
gamma: float = 0.99
lambda_: float = 0.95 # GAE/lambda-return mixing
actor_entropy: float = 0.05 # high entropy: WM dynamics are noisy enough that
actor_entropy: float = 0.05 # WM dynamics are noisy at this budget so a high
# entropy bonus prevents the actor from collapsing
# onto WM-exploit modes.

# Init epsilon for the first cycle - actions are uniform random before any AC training.
initial_random_steps: int = 500
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@ class WMConfig:
# Optimisation
lr: float = 6e-4
weight_decay: float = 1e-6
batch_size: int = 16
seq_len: int = 32
batch_size: int = 8
seq_len: int = 20
epochs: int = 5 # passes over the buffer per outer cycle
grad_clip: float = 100.0

Expand Down
Loading