[Preprint] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

Published:

Reinforcement learning from verifiable rewards (RLVR) improves LLM reasoning, but sampling temperature strategy plays a critical role. We distinguish between reasoning tokens and knowledge tokens, proposing higher temperatures for reasoning tokens to encourage exploration while retaining lower temperatures for knowledge tokens to preserve factual correctness. We evaluate multiple scheduling approaches across token- and rollout-level control, demonstrating consistent improvements on reasoning benchmarks.

Preprint available on [arXiv].