2025년 4월 23일
TTRL: Test-Time Reinforcement Learning
(Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou)
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
샘플링 이후 Majority Voting으로 선정한 답을 정답으로 삼아 RL. 비슷한 시도들이 나오고 있네요. (https://arxiv.org/abs/2504.05812) 다만 이렇게 되면 정말로 베이스 모델에서 샘플링 가능한 샘플의 확률을 바꾸는 것에 더 가까워지겠죠. (https://arxiv.org/abs/2504.13837)
RL on LLM by generating multiple samples and using the answers selected through majority voting as the correct answers. Similar approaches are emerging (https://arxiv.org/abs/2504.05812). However, these methods would likely result in something closer to altering the probability distribution of samples that can be drawn from the base model (https://arxiv.org/abs/2504.13837).
#rl #reasoning\