ChatGPU · ChatGPU · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/README.md b/labs/world_models/lab_dreamer_cartpole_pixels/README.md
@@ -28,9 +28,21 @@ jupyter nbconvert --execute --to notebook --inplace notebook.ipynb
 
 ## 这个 lab 证明了什么
 
-- **WM 学会了视觉动力学**：reconstruction strip 显示重建图像跟随真实 cart/pole 的位置与角度；
-- **策略完全在 imagination 中改进**：actor-critic 从未直接看到像素，只看 (h_t, z_t)；
-- **重建质量决定上限**：长 horizon rollout 的 pixel MSE 单调增长，这条曲线就是世界模型的"梦境寿命"，policy 在 imagination 里能挖到的回报受它约束。
+- **WM 学会了视觉动力学**：reconstruction strip 显示重建图像跟随真实 cart/pole 的位置与角度，pixel MSE 从 ~1500 降到 ~40；
+- **策略完全在 imagination 中训练**：actor-critic 从未直接看到像素，只看 (h_t, z_t)。整个 8-cycle 流程在 ~4 分钟内跑完；
+- **重建质量决定上限**：15 步开环 latent rollout 的 pixel MSE 在 5 步后开始上升，说明 WM 的"梦境寿命"约束了 imagination-based policy 能挖到的回报。
+
+### 关于策略提升的诚实说明
+
+在 8 分钟 CPU 预算 + 仅约 1800 真实 env steps 下，actor-critic 的回报曲线
+徘徊在随机策略水平（~20）附近，并未显著上升。可能原因：
+(1) CartPole 的 reward 是常数 +1，WM 的 reward head 训练后梯度几乎为零，
+关键信号其实来自 continue head；
+(2) 在这点数据预算下 continue head 学到的区分能力有限；
+(3) actor 在 imagination 中容易钻 WM 的局部不准确（"想象里看似变好但真实环境里崩"）。
+原版 Dreamer 通常需要 1e5+ env steps 才能稳定收敛——本 lab 主打的是
+"WM 学得到 + imagination 可视化 + 长 horizon 误差可量化"这一最小可复现核心，
+policy improvement 部分仅作为框架完备性的演示，提升空间留给 stretch goals。
 
 ## 文件契约
 

diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/assets/latent_vs_real_rollout.png b/labs/world_models/lab_dreamer_cartpole_pixels/assets/latent_vs_real_rollout.png
diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/assets/reconstruction_grid.png b/labs/world_models/lab_dreamer_cartpole_pixels/assets/reconstruction_grid.png
diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/assets/return_vs_steps.png b/labs/world_models/lab_dreamer_cartpole_pixels/assets/return_vs_steps.png
diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/notebook.ipynb b/labs/world_models/lab_dreamer_cartpole_pixels/notebook.ipynb
diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/src/policy.py b/labs/world_models/lab_dreamer_cartpole_pixels/src/policy.py
@@ -5,7 +5,8 @@
 2. Train the world model on observation sequences.
 3. Sample starting latents from the buffer, imagine fixed-horizon rollouts
    under the actor, compute lambda-returns from the predicted rewards and
-   critic, and update actor+critic by backprop through imagination.
+   critic, and update actor+critic with REINFORCE + value regression in the
+   imagination space (Dreamer V2 style for discrete actions).
 4. Iterate 8 outer cycles.
 
 Running ``python -m src.policy`` performs the full Dreamer cycle and saves
@@ -67,7 +68,9 @@ class PolicyConfig:
     ac_updates_per_cycle: int = 50    # gradient steps on actor+critic each cycle
     gamma: float = 0.99
     lambda_: float = 0.95             # GAE/lambda-return mixing
-    actor_entropy: float = 0.05       # high entropy: WM dynamics are noisy enough that
+    actor_entropy: float = 0.05       # WM dynamics are noisy at this budget so a high
+                                       # entropy bonus prevents the actor from collapsing
+                                       # onto WM-exploit modes.
 
     # Init epsilon for the first cycle - actions are uniform random before any AC training.
     initial_random_steps: int = 500

diff --git a/labs/world_models/lab_dreamer_cartpole_pixels/src/world_model.py b/labs/world_models/lab_dreamer_cartpole_pixels/src/world_model.py
@@ -57,8 +57,8 @@ class WMConfig:
     # Optimisation
     lr: float = 6e-4
     weight_decay: float = 1e-6
-    batch_size: int = 16
-    seq_len: int = 32
+    batch_size: int = 8
+    seq_len: int = 20
     epochs: int = 5  # passes over the buffer per outer cycle
     grad_clip: float = 100.0