2025년 6월 26일

Jun 26, 2025

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

(Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu)

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

수학과 Instruction 데이터를 Mid-training에 투입해 추론 RL의 성능을 높인 연구. Mid-training은 중요한 부분이지만 벤치마크를 해킹하기에도 좋은 단계일 것이기 때문에 균형이 필요하겠죠.

This study enhances reasoning RL performance by incorporating mathematical and instructional data during mid-training. While mid-training is a crucial stage, it could also be exploited to hack benchmarks, so a cautious approach is necessary.

#rl #reasoning #mid-training

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

(Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang)

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.

애플도 Diffusion LLM을 만들었네요. Diffusion LLM이 Diffusion이지만 디코딩 순서가 Autoregressive스러운 것에 대한 분석이 있습니다.

Apple has also developed a diffusion LLM. The paper includes an analysis of how diffusion LLMs, despite being based on diffusion models, can exhibit decoding patterns similar to autoregressive models.

#diffusion #llm #code

2025년 6월 26일

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Discussion about this post