2024년 4월 24일

Apr 24, 2024

Multi-Head Mixture-of-Experts

(Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei)

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.

MoE에서 토큰 임베딩을 쪼개고 각 임베딩마다 Top-K개의 Expert로 라우팅 되도록 만든 디자인. 사용하는 Expert의 다양성, 특히 개별 토큰에 대해 사용하는 Expert의 다양성을 높인다는 아이디어군요. 작은 Expert의 더 다양한 조합을 사용한다는 요즘 유행과 잘 맞지 않나 싶습니다.

#moe

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

(Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari)

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring 2×2× fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}. Additionally, \model models can be found on HuggingFace at: \url{https://huggingface.co/apple/OpenELM}.

평범한(?) Transformer LLM인데 모든 레이어에 대해서 같은 숫자의 헤드와 FFN Expansion Ratio를 사용하는 대신 후반 레이어에 대해 더 많은 수의 헤드와 더 큰 Expansion Ratio를 사용했군요.

#transformer #llm

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

(Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis)

Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond. A crucial drawback of DMs is their slow sampling speed, relying on many sequential function evaluations through large neural networks. Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. While past works primarily focused on deriving efficient solvers, little attention has been given to finding optimal sampling schedules, and the entire literature relies on hand-crafted heuristics. In this work, for the first time, we propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs, called Align Your StepsAlign Your Steps. We leverage methods from stochastic calculus and find optimal schedules specific to different solvers, trained DMs and datasets. We evaluate our novel approach on several image, video as well as 2D toy data synthesis benchmarks, using a variety of different samplers, and observe that our optimized schedules outperform previous hand-crafted schedules in almost all experiments. Our method demonstrates the untapped potential of sampling schedule optimization, especially in the few-step synthesis regime.

Diffusion 모델의 샘플링 과정에서 사용하는 SDE Solver의 Discretization에서 발생하는 에러를 최소화하기 위한 샘플링 스케줄을 유도. 갑자기 디테일 수준이 달라지는군요.

#diffusion

NExT: Teaching Large Language Models to Reason about Code Execution

(Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin)

A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time.

코드를 고치는 방법을 LLM에게 학습시키기. 자연어 프롬프트, 코드, 테스트, 실행 과정에 대한 Trace를 주고 모델에게 추론과 수정한 코드를 생성하게 합니다. 이 코드를 유닛 테스트로 평가해서 올바른 추론만 걸러내서 학습시키는 방법입니다.

자연어 텍스트에는 모델의 추론에 필요한 행간의 의미가 빠져 있다는 이야기를 많이 하는데 코드는 그런 이런 Trace를 통해 행간을 비교적 채워넣을 수 있겠다 싶네요.

#self-improvement

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

(Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du, Dacheng Tao)

Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, following~\citet{cot_wei}, including understanding errors, calculation errors and process errors (e.g. missing-step and hallucinations). Subsequently, Our in-depth analysis of various error types has found that deeply understanding the whole problem is critical in addressing complicated reasoning tasks. In this paper, we proposed a novel prompt strategy called Deeply Understanding the Problems (DUP) prompting, inspired by how humans solve complex reasoning problems, designed to enhance the comprehensive understanding of problems by LLMs. It consists of three stages: 1) extract the core question; 2) find out problem-solving information based on the core question; 3) generate and extract answers by LLMs. We evaluate the performance of DUP prompting on ten diverse reasoning datasets. Experimental results suggest that DUP prompting significantly outperforms Zero-Shot CoT ~\cite{kojima2022large} across all datasets. Notably, DUP achieves \textbf{state-of-the-art on SVAMP (90.4% to 94.2%) and GSM8K (94.6% to 97.1%).}

핵심 질문을 추출하고, 그 질문에 관련된 정보를 추출한 다음 CoT. 단순하고 재미있는 프롬프팅 전략 같네요.

#prompt

SnapKV: LLM Knows What You are Looking for Before Generation

(Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen)

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

생성 길이와 관계 없이 프롬프트에서 Attention이 꽂히는 토큰은 비슷하다는 관찰에서 출발. 더 나아가 프롬프트의 뒷부분에서 추출한 Attention Weight를 사용해 KV Cache에 포함시킬 임베딩을 선택하는 것으로 생성에 필요한 임베딩을 남길 수 있다는 결론. 그러나 풀링을 통해 인접한 정보들을 가져오는 것이 필요하다는 점에서는 압축도 필요하다는 느낌이네요.

#efficiency