2025년 2월 19일
MoBA: Mixture of Block Attention for Long-Context LLMs
(Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu)
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in com- plex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the “less structure” principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We intro- duce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates su- perior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demon- strates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/moba.
R1에 이어 이번에도 Moonshot AI가 DeepSeek과 비슷한 시점에 비슷한 논문을 공개했네요. Sparse Attention입니다. NSA와 비슷하게 (https://arxiv.org/abs/2502.11089) 시퀀스를 블럭으로 쪼개고 풀링으로 대표 토큰을 만든 다음 Top-K 블럭을 뽑는 방식입니다.
Moonshot AI has once again published a paper similar to DeepSeek's at around the same time like R1. This one focuses on sparse attention. Similar to NSA (https://arxiv.org/abs/2502.11089), it divides sequences into blocks, creates representative tokens using pooling, and then selects the top-K blocks.
#sparsity #efficiency #efficient-training
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
(Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li)
Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.
웹 코퍼스에서 질문과 응답을 추출하기 위한 방법. 이전에도 중요한 방향이었지만 정답을 활용한 추론 능력이 중요해진 상황에서 더 중요해졌다고 할 수 있겠네요.
A method for extracting questions and answers from web corpora. This has always been an important direction, but it has become even more crucial now that training reasoning abilities using answers has became most significant research direction.
#synthetic-data
Learning to Reason at the Frontier of Learnability
(Thomas Foster, Jakob Foerster)
Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning in LLMs.
정답률 p를 사용한 Learnability p(1 - p)를 사용해 추론 RL을 진행. 이런 커리큘럼을 만드는 방법들이 자연스럽게 등장하고 있네요. (https://arxiv.org/abs/2502.11886)
Reasoning RL with learnability p(1 - p), where p is the rate of correct answers. Methods for creating such curricula are now emerging naturally. (https://arxiv.org/abs/2502.11886)
#rl #reasoning
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
(Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li)
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S^2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S^2R. Our code and data are available at https://github.com/NineAbyss/S2R.
SFT로 행동 양식을 주입하고 RL을 한다는 기본적인 파이프라인. Revision을 학습시키기 위한 SFT 데이터 구축 방법이 중요한 부분이겠네요.
This paper presents a basic pipeline of injecting behavioral patterns through SFT and then applying RL. The method for constructing SFT data to train revision abilities would be most interesting part of the method.
#rl #reasoning
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
(Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu)
The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.
사고의 길이와 정답률의 관계 문제에 대한 분석. 같은 문제에 대해서도 정답인 경우가 오답인 경우보다 사고의 길이가 짧다고 하는군요. 사고의 길이를 늘리는 주요한 원인은 그다지 효과적이지 않은 Revision을 하기 때문이라고 합니다.
Revision은 참 학습시키기 어려운 능력 중 하나네요. 어쩌면 자신의 생성 결과가 다시 증거가 되는 Autoregression의 기본적인 문제에 기원하는 것일지도 모르겠습니다.
Analysis of the relationship between reasoning length and accuracy rate. The authors note that for the same questions, correct answers tend to have shorter reasoning processes than incorrect ones. One of the main factors that increases the length of reasoning is the model's ineffective revision attempts.
Revision is indeed one of the most challenging capabilities to train. Perhaps this stems from a fundamental issue in autoregression, where the model's own generated output becomes evidence for subsequent generations.
#reasoning #inference-time-scaling