Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

Published in arXiv, 2025

Recommended citation: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang. (2025). "Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR." arXiv:2510.08892. https://arxiv.org/pdf/2510.08892

Reinforcement learning from verifiable rewards (RLVR) improves reasoning in LLMs, but sampling strategy matters. We distinguish between reasoning tokens and knowledge tokens, and propose applying higher temperatures for reasoning tokens to encourage exploration while retaining lower temperatures for knowledge tokens to preserve factual correctness. We evaluate multiple scheduling approaches and demonstrate consistent improvements in reasoning benchmarks.

Download paper here

Recommended citation: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang. (2025). “Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR.” arXiv:2510.08892.