2025년 4월 1일
Expanding RL with Verifiable Rewards Across Diverse Domains
(Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, Dong Yu)
Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.
다양한 도메인에 대한 비구조화된 응답을 정답을 사용하는 Reward Model로 평가. DeepSeek-R1에서도 비슷한 방법을 언급하고 있죠. 여기에서 추론까지 사용하는 것을 생각할 수 있겠죠.
Scoring unstructured responses across diverse domains using a reward model which utilizes ground truth. DeepSeek-R1 also mentions a similar approach. We can consider extending this method to incorporate reasoning as well.
#reasoning #rl #reward-model
TransMamba: Flexibly Switching between Transformer and Mamba
(Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang)
Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.
Attention과 SSM의 레이어 간 결합, 레이어 내 결합에 (https://arxiv.org/abs/2411.13676) 이어 시퀀스 내 결합까지 등장했네요. 시퀀스 내에서 Attention에서 Mamba로 전환하는 방법입니다.
Following the inter-layer and intra-layer combinations of attention and SSM (https://arxiv.org/abs/2411.13676), intra-sequence combination is now appearing. This method involves switching from attention to Mamba within the sequence.
#state-space-model