2024년 5월 7일

May 07, 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

(DeepSeek-AI)

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.

DeepSeek의 21B Activated / 236B Total Parameter, 8.1T 학습 MoE 모델. GQA나 MQA 대신 Key/Value를 Down Projection 한 다음 Up Projection을 다시 하는 방법을 썼네요. 성능 문제 때문이었다고 하는데 이쪽은 생각을 해봐야 할 것 같습니다.

MoE는 DeepSeek MoE의 세팅과 유사하게 2개의 공유 Expert, 160개의 Expert 중에서 6개의 Expert로 라우팅하는 형태입니다. 여러 개 Expert에 대해 라우팅하면서 발생하는 통신 문제 억제를 위해 최대 M개 디바이스의 Expert에 대해서만 라우팅하도록 하는 밸런싱이 들어갔습니다.

DeepSeek는 늘 일반적인 선택과는 조금씩 다른 선택을 하고 있네요.

데이터셋 필터링에 대해서도 데이터를 필터링해서 버리는 것 뿐만 아니라 이전에 잘못 버렸던 데이터들을 다시 학습에 사용하는 문제에 대해 언급하고 있는 것이 재미있습니다.

학습 인프라 측면에서는 Zero Bubble Pipeline Parallelism (https://arxiv.org/abs/2401.10241) 을 채택했다는 부분이 흥미롭습니다. 실제로 잘 작동하고 효과가 있는 모양이네요.

#moe #llm

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

(Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis)

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

Expert FFN의 파라미터를 가중 평균한 FFN으로 Forward해서 미분 가능한 Mixture of Experts 모델을 만드는 전략. (https://arxiv.org/abs/2306.03745) 그러나 가중 평균 자체가 비싸기 때문에 시퀀스를 세그먼트로 자른 다음 이전 세그먼트에 대해 계산한 가중치로 다음 세그먼트에 사용할 FFN을 생성하는 전략 + 서로 비슷한 텍스트로 시퀀스를 구성해 학습을 쉽게 만드는 트릭을 사용했습니다. Super Token을 쓰는 전략도 생각나네요. (https://arxiv.org/abs/2311.10768)

#moe

Is Flash Attention Stable?

(Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu)

Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of instability during training, often taking the form of loss spikes. Numeric deviation has emerged as a potential cause of this training instability, although quantifying this is especially challenging given the costly nature of training runs. In this work, we develop a principled approach to understanding the effects of numeric deviation, and construct proxies to put observations into context when downstream effects are difficult to quantify. As a case study, we apply this framework to analyze the widely-adopted Flash Attention optimization. We find that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16 when measured during an isolated forward pass. We then use a data-driven analysis based on the Wasserstein Distance to provide upper bounds on how this numeric deviation impacts model weights during training, finding that the numerical deviation present in Flash Attention is 2-5 times less significant than low-precision training.

Flash Attention과 Vanilla Attention의 수치 오차에 대한 분석. Flash Attention이 생각보다 큰 오차를 갖고 있네요. 학습에 미치는 영향으로서는 크지 않을 것 같다고 하고 있는데 이런 수치 오차가 온갖 곳에서 골치 아픈 문제를 발생시킬 수 있는 잠재성을 갖고 있겠죠.

https://x.com/tri_dao/status/1787767984360153283

이 문제에 대해서 Tri Dao가 구현 방식의 문제라는 언급을 했습니다.

#efficient-training

MAmmoTH2: Scaling Instructions from the Web

(Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen)

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

QA 형태의 문서를 웹 크롤에서 발굴한 다음 LLM으로 QA 페어를 추출하고 Refine해서 Instruction 데이터로 사용한 방법. 여기서는 Instruction Tuning을 위한 데이터로 사용했는데 프리트레이닝 시점에 사용할 대규모 고품질 데이터로서는 어떨까 싶기도 하네요. (https://arxiv.org/abs/2401.16380)

#instruction-tuning #synthetic-data

AlphaMath Almost Zero: process Supervision without process

(Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan)

Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise. In this study, we introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically. Essentially, when a LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions. We proceed to train a step-level value model designed to improve the LLM's inference process in mathematical domains. Our experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model's proficiency in dealing with intricate mathematical reasoning tasks.

답을 사용한 Process Supervision 하나 더. 요즘 이 문제에 MCTS를 적용한 사례가 많이 등장하고 있군요. (https://arxiv.org/abs/2404.12253, https://arxiv.org/pdf/2405.00451) 사실상 방법은 수렴하고 있고 대규모로 질문과 답 페어를 확보할 수 있는 방법을 고안하는 것이 문제일지도 모르겠습니다.

#search

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

(Jordan Dotzel, Yuzong Chen, Bahaa Kotb, Sushma Prasad, Gang Wu, Sheng Li, Mohamed S. Abdelfattah, Zhiru Zhang)

Large language models (LLMs) have recently achieved state-of-the-art performance across various tasks, yet due to their large computational requirements, they struggle with strict latency and power demands. Deep neural network (DNN) quantization has traditionally addressed these limitations by converting models to low-precision integer formats. Yet recently alternative formats, such as Normal Float (NF4), have been shown to consistently increase model accuracy, albeit at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks to conclude most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), with respect to this distribution, that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and performance frontier across 11 datatypes, including non-traditional formats like Additive-Powers-of-Two (APoT), by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits.

t-분포를 사용한 Quantization 포맷. LLM 가중치가 Heavy Tail이라는 점에서 정규분포보다 t-분포가 더 적합하다는 아이디어군요.

#quantization

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

(Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut)

Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.

고밀도 이미지 캡션 데이터셋 구축. 기본적으로 VLM이 생성한 출력을 사람이 개선하고, 개선된 데이터를 다시 사람이 평가하고 개선하면 이 데이터를 사용해 VLM을 파인튜닝하는 루프를 사용합니다.

첫 단계 과제로 Object Detection으로 객체들의 목록을 만든 다음 개선하고, 그리고 찾아낸 객체들과 캡션을 기반으로 상세한 캡션을 작성하는 과제를 수행합니다.

이렇게 9K 분량의 데이터셋을 구축했네요. 이미지 캡션 데이터셋의 끝을 보겠다는 느낌이군요.

#vision-language #captioning #dataset

Language-Image Models with 3D Understanding

(Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone)

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.

박스와 포인트 레이블을 사용해서 학습한 3D Aware LLM. LLM처럼 하나의 Objective로 "알아서" 이미지의 모든 특성을 추출할 수 있다면, 그리고 마찬가지로 "알아서" 이미지와 텍스트, 그리고 텍스트로 표현 가능한 모든 과제를 연결할 수 있다면 어떨까 하는 생각을 합니다. 다만 어쩌면 이미지에 대해서는 수많은 과제를 통해 접근하는 것이 나은 방법일지도 모르겠네요. (https://arxiv.org/abs/2312.00785) 또는 비디오가 문제를 모두 해결해줄 수도 있겠죠. 앞으로 어떻게 흘러갈지 궁금한 문제입니다.

#vision-language #3d

2024년 5월 7일

Discussion about this post