EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
(Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang)
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.
Speculative Sampling에 대한 개선. 토큰을 예측하는 것보다 임베딩을 예측하는 것이 쉽다고 하고, 따라서 다음 스텝의 임베딩을 예측하고 LM Head로 토큰을 생성합니다.
그런데 다음 스텝의 임베딩은 어떤 토큰이 샘플링 되어 Forward 되는가에 따라 달려있죠. 그래서 샘플링한 토큰에 대한 임베딩도 같이 입력으로 받아 예측합니다.
#efficiency
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
(Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman)
Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression
트랜스포머에 대한 Structured Pruning. RMSNorm이 직교 행렬 Q에 대해 RMSNorm(X) = RMSNorm(XQ)Q^T인 것을 이용합니다. PCA로 구한 Q를 사용해 Function Preserving Transform으로 트랜스포머를 변형하고 Principal Component를 삭제하는 만큼 Weight에서 행이나 컬럼을 삭제하는 방식입니다.
#pruning
Learning Universal Predictors
(Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Grégoire Delétang, Elliot Catt, Anian Ruoss, Li Kevin Wenliang, Christopher Mattern, Matthew Aitchison, Joel Veness)
Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data. Broad exposure to different tasks leads to versatile representations enabling general problem solving. But, what are the limits of meta-learning? In this work, we explore the potential of amortizing the most powerful universal predictor, namely Solomonoff Induction (SI), into neural networks via leveraging meta-learning to its limits. We use Universal Turing Machines (UTMs) to generate training data used to expose networks to a broad range of patterns. We provide theoretical analysis of the UTM data generation processes and meta-training protocols. We conduct comprehensive experiments with neural architectures (e.g. LSTMs, Transformers) and algorithmic data generators of varying complexity and universality. Our results suggest that UTM data is a valuable resource for meta-learning, and that it can be used to train neural networks capable of learning universal prediction strategies.
Universal Turing Machine에 랜덤 프로그램을 입력해 생성한 결과물로 모델을 학습시켰을 때 Solomonoff Induction을 근사할 수 있다는 연구. Solomonoff Induction이 무엇인지부터 문제인데 https://www.lesswrong.com/posts/Kyc5dFDzBg4WccrbK/an-intuitive-explanation-of-solomonoff-induction 이 글이 (길지만) 도움이 되는군요. 간단히 요약하면 데이터를 설명하는 가장 단순한 프로그램을 찾는 것이라고 할 수 있겠습니다.
이게 어떤 함의를 갖는 걸까요? 좀 생각해봐야할 문제일 것 같습니다. 한 가지 단서는 LLM과 Solomonoff Induction의 관계에 대한 추론이네요. LLM에서 In context Learning 능력이 발생한다는 것은 Solomonoff Induction을 근사하고 있다는 의미일 수 있고, Universal Turning Machine으로 생성된 데이터가 아니라 (사람이 풀기 원하는 과제들과 좀 더 잘 맞는) 사람의 텍스트 데이터로 학습된 것이라는 시각입니다.
#meta-learning
Sutskever는 일반적으로 뉴럴넷 학습 자체가 일종의 solomonoff induction과 같은 것을 하고 있다는 직관이 있는 것 같더군요. https://www.youtube.com/live/AKMuA_TVz3A?si=NBgUulVAQEia42XR