2024년 9월 20일

Sep 20, 2024

Training Language Models to Self-Correct via Reinforcement Learning

(Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust)

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

아주 중요한 결과가 나왔네요. Self Correction을 어떻게 LLM이 하도록 할 수 있는지에 대한 방법입니다.

기본적으로는 SFT로는 안 된다는 것에서 시작합니다. 작은 수정만을 하는 경향이 있고 또한 Distribution Shift의 문제가 아주 크게 작용합니다.

그래서 답은 RL이죠.

SFT를 아예 빼버리고 1단계에서 첫 번째 응답 분포에 대한 KL Penalty를 건 상태에서 두 번째 응답에 대해 RL을 합니다. 2단계에서는 Self Correction을 촉진하도록 Reward Shaping을 한 다음 RL을 진행합니다.

이 문제를 RL로 다루는 것이 중요하다는 이야기는 이미 o1에 대해서도 나오고 있습니다. (https://x.com/wgussml/status/1834691198013129053) 따라서 이 문제에 대해 진지하다면 이 측면에서 생각해봐야겠죠.

#rl #reasoning

Don't teach. incentivize.

(Hyung Won Chung)

https://docs.google.com/presentation/d/1nnjXIuN2XDJENAOaKXI5srQscO3276svvP6JgivTv6w

Hyung Won Chung의 MIT 세미나. 모델에게 능력을 가르치려고 하지 말고 그 능력을 습득할 유인을 주라는 이야기입니다. 예컨대 Next Token Prediction을 통해 우리는 모델에게 수많은 능력과 지식을 습득할 유인을 제공하는 것이죠. 그리고 이것이 훨씬 더 Scalable한 방향이라는 이야기를 합니다.

그리고 지금 안 된다고 해서 앞으로도 되지 않을 것이라고 생각해서는 안 된다는 것. 규모의 증가에 따라 모델의 한계는 계속해서 변화하기 때문에 이전 규모의 모델에 대해서 만들어진 이것은 되지 않는다는 지식을 계속해서 버려야 한다는 이야기입니다.

#scaling

Inference-Friendly Models with MixAttention

(Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley)

Character AI에서 공개한 (https://research.character.ai/optimizing-inference/) Sliding Window Attention과 KV Cache 공유를 통한 추론 성능 최적화에 대해 실험해봤군요. MQA는 아니고 GQA 세팅입니다.

여러 변형들에 대해 테스트했는데 Long Context 과제 데이터에 대해서 학습시킬 때에야 성능적 차이가 드러났다고 하네요. (학습이 110B 정도 규모이긴 합니다.)

Local Attention과 Global Attention을 섞는 것은 비교적 안전한 선택이지 않을까 싶은데 (SSM과 트랜스포머 하이브리드의 세팅과도 유사하고 Gemma 2도 이런 세팅이죠. Gemma 2는 Long Context 모델은 아니긴 합니다만.) KV Cache 공유 세팅을 다듬을 필요는 있을 것 같네요. 다만 Long Context 문제에 적용했을 때 차이가 두드러질 수 있다는 것은 성가신 부분일 듯 합니다.

#efficient-attention #efficiency

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

(Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska)

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the \frameworkname framework (\frameworkshort) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using \frameworkshort, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.

난이도와 컨텍스트 길이가 쉽게 조절 가능하고 평가가 쉬우면서도 Needle in the Haystack보다 더 흥미로운(?) 벤치마크.

컨텍스트 길이가 길어지면 성능이 저하되는 이유는 무엇일까, 그리고 동시에 일정한 수준의 성능이 더 긴 컨텍스트 길이에 대해서도 유지되는 이유는 무엇일까에 대해서 잠깐 생각해보게 되네요. 무언가 짧은 길이와 긴 길이라는 두 그룹에 차이가 있고 긴 길이 내에서는 길이는 중요하지 않다는 의미 같기도 합니다.

#long-context #benchmark

Scaling FP8 training to trillion-token LLMs

(Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry)

We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a ∼34%∼34% throughput improvement.

FP8로 장기 학습을 진행하면서 발견한 문제. 학습을 일정 이상 진행한 이후에야 아웃라이어가 두드러지게 발생하기 시작하는데 이 아웃라이어가 FP8 학습에 문제를 발생시킨다는 것이죠.

여기서 주목한 것은 SwiGLU입니다. Swish도 일정 이상의 값에 대해서는 거의 선형이므로 게이트 값과 게이트에 곱해지는 값이 잘 정렬되어 있다면 Activation의 크기가 제곱으로 증가하게 되죠. 이것이 아웃라이어를 야기한다고 합니다.

그래서 Scaling Factor를 추가해서 이 문제를 해소했군요. 추가적으로 FP8 State를 사용하는 Adam도 사용해서 2T 학습을 했습니다. 재미있네요.

#efficient-training #quantization

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

(Mohammad Samragh, Iman Mirzadeh, Keivan Alizadeh Vahid, Fartash Faghri, Minsik Cho, Moin Nabi, Devang Naik, Mehrdad Farajtabar)

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

작은 모델을 사용해 큰 모델을 초기화하는 전략. 이쪽에서는 레이어를 늘리기보다는 Weight를 복붙해서 너비를 넓히는 쪽으로 했군요. Gopher나 Layer Stacking에서도 비슷한 전략을 찾을 수 있고 (https://arxiv.org/abs/2405.15319) 보통 레이어 증가보다는 효과적이지 않다고 했던 것 같긴 합니다. 이쪽은 결과가 달랐던 이유가 궁금하네요.

#efficient-training

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

(Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You)

Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B.

Common Crawl에서 수집한 수학 데이터. Common Crawl에서 수학 데이터를 수집하는 사례들은 증가했지만 OpenWebMath 이후로 공개된 수학 데이터는 별로 없었는데 드디어 하나가 등장했군요. 거기에 Interleaved 이미지를 고려한 Multimodal 데이터네요.

#corpus

Language Models Learn to Mislead Humans via RLHF

(Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Boman, He He, Shi Feng)

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

RLHF를 거치면서 모델이 실제 성능은 높아지지 않으면서 사람에 대한 설득력은 높아져 사람의 평가는 높일 수 있다는 결과. 사람이 작성한 유닛 테스트에도 비슷한 현상이 일어날 수 있습니다. 사람에 대해서 Reward Hacking을 했다고 할 수 있겠네요.

Alignment 연구자들이 가장 두려워하는 주제라 혹시 싶었는데 아니나 다를까 Anthropic 저자들이 들어가 있네요.

#alignment #safety

2024년 9월 20일

Discussion about this post