- 🎓 Undergraduate student at Beijing University of Posts and Telecommunications (BUPT), School of Computer Science
- 🔬 Research interests: RLVR · RLHF · Optimization Algorithms
- 🌱 Currently exploring the intersection of reinforcement learning and large language model alignment
- 📍 Beijing, China
| Area | Description |
|---|---|
| RLVR | Reinforcement Learning from Verifiable Rewards — scalable reward signals beyond human feedback |
| RLHF | Reinforcement Learning from Human Feedback — aligning LLMs with human preferences |
| Optimizer | Adaptive optimization methods (AdamW, Muon, Shampoo, etc.) for deep learning |
APO_OFFICAL — [ICML 2026] The official repository for Anchored Policy Optimization: Mitigating Exploration Collapse via Support-Constrained Rectification
⭐ 14 🍴 1
SPPO — [ACL 2026 Oral] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks official repos.
⭐ 3 🍴 3
No recent public activity.
2026-06-19第四章:向量处理机2026-04-27Uncertain Estimate2026-04-27笔记:DPO 与 GRPO 的内在同构性分析2026-04-27第十一章:文件系统实现2026-04-27第八章:内存管理
"The pursuit of intelligence — from theory to practice." · Last updated: auto-refreshed every 3 hours