2025년 6월 4일
Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
(Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang)
We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE
코드 생성과 함께 유닛 테스트 생성을 학습. Verification을 같이 학습시키는 것처럼 (https://arxiv.org/abs/2505.13445, https://arxiv.org/abs/2506.01369) 하지 않을 이유가 딱히 없을 것 같네요.
Training unit test generation alongside code generation. Similar to simultaneously training verification (https://arxiv.org/abs/2505.13445, https://arxiv.org/abs/2506.01369), I don't think there is any particular reason not to do this.
#rl #reasoning
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs
(Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li)
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.
시퀀스 방향으로 입력을 쪼개서 순차적으로 Backprop을 하는 방법. 이전의 Blockwise Transformer가 생각나네요 (https://arxiv.org/abs/2305.19370).
A method of performing backprop sequentially by dividing the input along the sequence direction. This reminds me of the previous study on blockwise transformer (https://arxiv.org/abs/2305.19370).
#efficiency #long-context #transformer
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
(Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng)
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Passk spectrum (k up to 256), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@1 but degrades performance at higher k, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@k performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.
오답에 대한 처벌만으로 추론 RL이 가능하다는 연구. 물론 요즘은 X만으로 추론 RL이 가능하다고 하려면 Qwen 외의 모델에 대한 결과가 필요하긴 합니다.
A study shows that reasoning RL is possible solely by penalizing incorrect answers. However, to claim that reasoning RL is possible with just X, results from models other than Qwen would be necessary these days.
#rl #reasoning