Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR
Published in arXiv, 2025
Recommended citation: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang. (2025). "Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR." arXiv:2510.08892. https://arxiv.org/pdf/2510.08892
Reinforcement learning from verifiable rewards (RLVR) improves reasoning in LLMs, but sampling strategy matters. We distinguish between reasoning tokens and knowledge tokens, and propose applying higher temperatures for reasoning tokens to encourage exploration while retaining lower temperatures for knowledge tokens to preserve factual correctness. We evaluate multiple scheduling approaches and demonstrate consistent improvements in reasoning benchmarks.
Recommended citation: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang. (2025). “Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR.” arXiv:2510.08892.
