2024년 10월 11일

Oct 11, 2024

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

(Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar)

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is >8% more accurate, and 1.5-5× more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with 5-6× gain in sample efficiency, and >6% gain in accuracy, over ORMs.

굉장히 중요한 결과인 것 같네요. Scalable한 Process Reward Model은 어떠해야 할 것인가? 여기에 대해서 Q가 아니라 Advantage여야 한다고 합니다. 그리고 이 Advantage 계산에 사용되는 Prover Policy는 Base Policy는 달라야 하지만 동시에 큰 차이가 있어서는 안 된다는 조건입니다. 그래서 선택한 것은 Base Policy + Best of K를 Prover Policy로 사용하는 것이네요.

This seems to be an extremely important result. What should a scalable process reward model look like? The paper suggests that it should be based on advantage rather than Q. Additionally, there's a condition that the prover policy used in calculating the advantage should be different from the base policy, but at the same time, it shouldn't deviate too much from it. As a solution, the authors chose to use the base policy + best of K as the prover policy.

#rl #reasoning

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

(Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan)

In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network~(FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. This design offers three key advantages: (i) Low Computing Overhead: Unlike the uniform mixing mechanism for all tokens within vanilla MoE, MoE++ allows each token to engage with a dynamic number of FFNs, be adjusted by constant vectors, or even skip the MoE layer entirely. (ii) High Performance: By enabling simple tokens to utilize fewer FFN experts, MoE++ allows more experts to focus on challenging tokens, thereby unlocking greater performance potential than vanilla MoE. (iii) Deployment Friendly: Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts. Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.

Identity 같은 연산량 요구가 적은 연산들을 Expert의 선택지로 줘서 표현력을 강화하면서 동시에 전체적인 연산량을 줄이는 방법. 재미있는 아이디어네요.

Enhancing representational power while simultaneously reducing overall computational requirements by including computationally inexpensive operations like identity as expert options. Interesting idea.

#moe

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

(Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li)

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathllm/MathCoder2 .

자연어 추론과 코드가 혼합된 데이터셋을 구축하는 방법. Common Crawl에서 수학 관련 데이터를 수집한 다음 LLM으로 코드를 생성시키고 답으로 검증하는 흐름이군요.

This paper presents a method for constructing a dataset that combines natural language reasoning with code. The process involves collecting mathematics-related data from Common Crawl, then using a LLM to generate code, which is subsequently verified by checking the correctness of its output.

#corpus #math #code

Scaling Laws For Diffusion Transformers

(Zhengyang Liang, Hao He, Ceyuan Yang, Bo Dai)

Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, e.g., image and video generation. However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget. Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT for the first time. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute. Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1e21 FLOPs. Additionally, we also demonstrate that the trend of pre-training loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.

DiT에 대한 Scaling Law 추정. Autoregressive와 Diffusion의 Scaling Law의 지수 차이를 추정해보면 재미있지 않을까 싶습니다만 (https://arxiv.org/abs/2405.13218) 아직 모델 아키텍처가 발전하는 중이니 쉽지는 않겠네요.

Estimation of scaling laws for DiT. It would be interesting to compare the exponents of scaling laws between autoregressive and diffusion models (https://arxiv.org/abs/2405.13218). However, as model architectures are still evolving, this comparison might not be straightforward yet.

#diffusion #scaling-law

Upcycling Large Language Models into Mixture of Experts

(Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro)

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Sparse Upcycling의 레시피 연구. LR이나 배치 크기를 어떻게 설정할 것인가, FFN을 쪼개 작은 Expert들로 나누는 방법 등을 탐색했습니다. 흥미로운 점은 FFN을 쪼개서 Dense 모델과 FLOP을 같게 만든 다음 Continual Pretraining을 했을 때 Dense 모델보다 더 나은 성능을 얻을 수 있다는 결과일 듯 하네요.

Study on recipe for sparse upcycling. Investigates how to set learning rates and batch sizes, as well as methods for splitting FFNs into smaller experts. An interesting finding is that when FFNs are split to match the FLOPs of dense models and then undergo continual pretraining, they can achieve better performance than dense models.

#moe #continual-pretraining