2024년 5월 22일

May 22, 2024

Mapping the Mind of a Large Language Model

(Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan)

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Anthropic의 Sparse Autoencoder를 사용한 Claude 3 Sonnet 분석 리포트. 이미지와 텍스트, 여러 언어에 걸쳐 공통된 Feature를 발견할 수 있었네요.

중간에 코드 에러 Feature가 나오는데 재미있습니다. 이 Feature를 강화시키면 에러가 없는 코드에도 에러 메시지를 할루시네이션하고, 약화시키면 에러가 있는 코드도 에러를 무시하는 경향이 있습니다. 이 상태에서 더 생성시키면 에러가 없는 코드를 다시 작성하는군요.

#interpretability

Phi-3

Phi-3 Small, Medium 모델과 Vision 모델이 공개됐군요. Phi-3의 특성이 과연 어떠한지를 알아내는 것이 중요한 문제라는 생각이 듭니다만 벤치마크는 신뢰할 수 없는 상황인 듯 하고, 아무래도 직접 분석해보는 수밖에 없겠죠.

#llm

Maximizing Training Throughput Using PyTorch FSDP and Torch.compile

(Team PyTorch at IBM, Team PyTorch at Meta)

IBM은 FSDP를 계속 깎고 있군요. torch.compile을 써써 A100에서 7B 모델에 대한 MFU를 68%까지 끌어올렸네요. A100 448개를 써서 7B, 4T 학습을 2주 내에 끝낼 수 있었다고 합니다.

#efficient-training

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

(William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly)

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

하나의 KV 캐시를 여러 레이어에서 공유해서 KV 캐시의 크기를 줄이는 방법. 얼마 전 Cross Attention을 사용해 인코더 임베딩처럼 사용한 방법과 비슷하네요. (https://arxiv.org/abs/2405.05254)

KV 캐시에 대한 압축 방법이 꽤 많이 등장하고 있고 DeepSeek V2의 MLA (https://arxiv.org/abs/2405.04434) 같은 경우 성능적으로도 흥미로운 가능성을 보여주고 있죠. 여담이지만 KV 캐시의 압축률이 높아질수록 State Space Model이 갖는 상태 크기에서의 이점이 압착되고 있긴 합니다.

#efficiency

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

(Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, Kaiqi Huang)

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with the complexity of managing multiple reward models. To address these issues, we propose Sequential Preference Optimization (SPO), a method that sequentially fine-tunes LLMs to align with multiple dimensions of human preferences. SPO avoids explicit reward modeling, directly optimizing the models to align with nuanced human preferences. We theoretically derive closed-form optimal SPO policy and loss function. Gradient analysis is conducted to show how SPO manages to fine-tune the LLMs while maintaining alignment on previously optimized dimensions. Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences and significantly outperforms the baselines.

Helpfulness vs Harmlessness 같은 다각적인 Preference가 충돌하는 상황에 대한 최적화. 요약하자면 각 Preference의 차원에 따라 데이터셋을 나누고 다음 단계의 최적화에서 이전 모델과에 대한 KL Penalty와 이전 데이터셋의 Reward에 대한 페널티를 추가한 형태입니다.

개인적으로 이 문제에 대해 Safe RLHF (https://arxiv.org/abs/2310.12773) 가 재미있었는데 엄밀하게 비교해보면 흥미로울 것 같네요.

#alignment #rlhf

SirLLM: Streaming Infinite Retentive LLM

(Yao Yao, Zuchao Li, Hai Zhao)

As Large Language Models (LLMs) become increasingly prevalent in various domains, their ability to process inputs of any length and maintain a degree of memory becomes essential. However, the one-off input of overly long texts is limited, as studies have shown that when input lengths exceed the LLMs' pre-trained text length, there is a dramatic decline in text generation capabilities. Moreover, simply extending the length of pre-training texts is impractical due to the difficulty in obtaining long text data and the substantial memory consumption costs this would entail for LLMs. Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs, but this approach can significantly impair the model's long-term memory capabilities. Motivated by this challenge, we introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues without the need for fine-tuning. SirLLM utilizes the Token Entropy metric and a memory decay mechanism to filter key phrases, endowing LLMs with both long-lasting and flexible memory. We designed three distinct tasks and constructed three datasets to measure the effectiveness of SirLLM from various angles: (1) DailyDialog; (2) Grocery Shopping; (3) Rock-Paper-Scissors. Our experimental results robustly demonstrate that SirLLM can achieve stable and significant improvements across different LLMs and tasks, compellingly proving its effectiveness. When having a coversation, "A sir could forget himself," but SirLLM never does! Our code is publicly available at https://github.com/Zoeyyao27/SirLLM

Streaming LLM (https://arxiv.org/abs/2309.17453) 스타일의 Attention Sink + 토큰 확률 분포의 엔트로피가 높은 토큰들을 중요하다고 보고 엔트로피가 높은 토큰들의 KV 캐시만 남기는 방법. 생각해보니 요즘 스트리밍 출력을 목표하고 있는 LLM들은 Context Length 문제를 어떻게 해결할 계획인지 궁금하네요.