2024년 2월 14일

Feb 14, 2024

Tandem Transformers for Inference Efficient LLMs

(Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli)

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.

구글의 새로운 모델 병합 시도군요. 작은 LM이 큰 LM의 임베딩을 참조할 수 있도록 연결합니다. 추론 시점에서 작은 모델로 디코딩한 다음 디코딩 결과를 큰 모델에 주입해서 검증하는 Speculative Decoding을 할 수 있습니다.

그나저나 구글은 Speculative Decoding을 SPEED라고 부르나 보네요.

#efficiency

Verified Multi-Step Synthesis using Large Language Models and Monte Carlo Tree Search

(David Brandfonbrener, Sibi Raja, Tarun Prasad, Chloe Loughridge, Jianang Yang, Simon Henniger, William E. Byrd, Robert Zinkov, Nada Amin)

We present an approach using Monte Carlo Tree Search (MCTS) to guide Large Language Models (LLMs) to generate verified programs in Dafny, Lean and Coq. Our method, which we call VMCTS, leverages the verifier inside the search algorithm by checking partial programs at each step. In combination with the LLM prior, the verifier feedback raises the synthesis capabilities of open source models. On a set of five verified programming problems, we find that in four problems where the base model cannot solve the question even when re-sampling solutions for one hour, VMCTS can solve the problems within 6 minutes. The base model with VMCTS is even competitive with ChatGPT4 augmented with plugins and multiple re-tries on these problems. Our code and benchmarks are available at https://github.com/namin/llm-verified-with-monte-carlo-tree-search .

LLM 디코딩 과정에서 트리 서치를 하면서 Dafny/Coq/Lean 같은 검증 능력이 있는 프로그래밍 언어를 사용해 검증해나가는 시도. Lean 같은 언어로 무언가를 할 수 있지 않을까 하는 생각들이 나왔었는데 그 사례가 될 것 같네요.

#decoding #search

World Model on Million-Length Video And Language With RingAttention

(Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel)

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

이쪽은 Ring Attention을 계속 밀어붙이고 있군요. 1M Context Length도 Ring Attention으로 일단 학습 가능하게 한 다음 튜닝하면 커버할 수 있는 것 같다는 결과입니다. 텍스트 뿐만 아니라 한 시간 분량의 비디오까지 집어넣고 추론하는 결과를 보여주고 있네요.

#long-context

2024년 2월 14일

Tandem Transformers for Inference Efficient LLMs

Verified Multi-Step Synthesis using Large Language Models and Monte Carlo Tree Search

World Model on Million-Length Video And Language With RingAttention

Discussion about this post