2024년 6월 28일

Jun 28, 2024

Gemma 2: Improving Open Language Models at a Practical Size

(Gemma Team, Google DeepMind)

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. The 9 billion and 27 billion parameter models are available today, with a 2 billion parameter model to be released shortly. In this new version, we provide several technical modifications to our architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3× bigger. We release all our models to the community.

Gemma 2가 나왔습니다. 2.6B (2T), 9B (8T), 27B (13T) 모델. Gemma의 굉장히 큰 FFN은 줄어든 대신 깊은 모델이 됐네요. 추가로 4096 Local Attention과 8192 Global Attention의 조합, Grok-1에서 등장한 tanh를 사용한 Attention Logit 클리핑, Group 2 GQA 등을 도입했습니다. Pre Norm과 Post Norm을 썼다는 언급이 다시 등장했는데 Gemma 2에서는 정말로 Post Norm을 쓴 것 같네요.

벤치마크 스코어는 잘 나왔습니다. Gemma 2 9B는 Llama 3 8B를 상회하고 Gemma 2 27B는 Llama 3 70B에 근접하는 수준의 스코어. Chatbot Arena에서도 Sonnet 3와 Llama 3 70B를 상회하는 지점에 있습니다. Chatbot Arena의 프롬프트로 학습을 하긴 했습니다만.

그러나 이런 부분들을 차치하면 가장 중요한 것은 Gemma 1.5 Flash에서도 언급되었던 Knowledge Distillation이 적극적으로 사용됐다는 것이겠네요. 여기서 좀 더 특이한 것은 Distillation을 대규모의 토큰에 대해 학습하면서 확보된 양의 토큰을 넘어서는 수준의 학습을 시뮬레이션하는 방법이라고 설명하고 있다는 것입니다. 이것의 의미가 무엇인지 생각해볼 필요가 있을 듯 하네요. Distillation과 관련해서 Gemini 1.5 Flash에 언급된 논문 중 특징적인 세 논문이 있는데 (https://arxiv.org/abs/1804.03235, https://arxiv.org/abs/2106.05237, https://arxiv.org/abs/2306.13649) 이쪽을 참조해보는 것이 의미가 있을 듯 합니다.

SFT 과정에서도 프롬프트에 대해 Teacher가 생성한 응답을 가지고 학습시키고, 동시에 Distillation을 이쪽에 대해서도 적용했다고.

전반적으로 하이퍼파라미터를 계속 조정하고 있는 것을 보면 의미 있는 하이퍼파라미터라는 느낌이 있고, Local Attention과 Global Attention을 조합하는 것 등은 최근 Character.AI에서도 소개한 것을 보면 (https://research.character.ai/optimizing-inference/) 중요한 트릭인 것 같기도 합니다. Knowledge Distillation도 유의미한 것 같은데 이쪽은 일단 큰 모델을 제대로 학습하는 것이 선행 과제이긴 하죠.

AI Studio에 올라가 있는데 한국어도 꽤 합니다. 한 번 테스트해보시는 것도.

#llm #distillation

LLM Critics Help Catch LLM Bugs

(Nat McAleese, Rai (Michael Pokorny), Juan Felipe Cerón Uribe, Evgenia Nitishinskaya, Maja Tr ̨ebacz, Jan Leike)

Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains “critic” models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as “flawless”, even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

LLM이 생성한 코드의 버그를 잡는 LLM. 전형적인 RLHF 과정을 거쳤습니다만 중요한 차이라면 LLM이 생성한 응답을 수정해서 일부러 버그를 삽입하는 과정을 거쳤고, LLM이 잘 포착하지 못하는 버그를 작성하도록 Adversarial한 어노테이션을 했다는 것이겠네요.

#rlhf #feedback

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

(Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Rozière, Jonas Gehring, Gabriel Synnaeve, Hugh Leather)

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce Meta Large Language Model Compiler (LLM Compiler), a suite of robust, openly available, pre-trained models specifically designed for code optimization tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The model has been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and has undergone instruction fine-tuning to interpret compiler behavior. LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. We also present fine-tuned versions of the model, demonstrating its enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). This release aims to provide a scalable, cost-effective foundation for further research and development in compiler optimization by both academic researchers and industry practitioners.

어셈블리와 IR로 학습한 다음 컴파일러의 최적화 결과, 코드에 대해 적용할 최적화 플래그, 디스어셈블리 등에 대해 Instruction Tuning을 적용한 모델.

IR을 사용하는 것은 The Stack v2 (https://arxiv.org/abs/2402.19173) 에서도 시도했었죠. 다른 과제들에 대해서 어떤 의미를 가질지 궁금하네요. 일반적으로는 추상화 수준이 높은 언어가 추론 능력에 도움이 되지 않을까 하는 생각은 듭니다.

#llm #code

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

(Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon)

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW β2 parameter is essential at lower batch sizes.

Kaplan Scaling Law와 Chinchilla Scaling Law를 종합하려는 시도 2. 여기서 발견한 원인은

Kaplan Scaling Law에서는 임베딩 파라미터를 빼버려서 LM 헤드의 FLOPS 기여분을 과소평가했다. 이쪽은 이전에 나온 논문과 통하네요. (https://arxiv.org/abs/2406.12907)
작은 모델에서 Warmup이 너무 길었다.
지나치게 큰 배치 크기 문제. DeepSeek LLM의 접근 (https://arxiv.org/abs/2401.02954) 을 사용해 배치 크기와 Learning Rate에 대한 Scaling Law 추정, 그리고 작은 배치 크기를 사용하는 경우 Adam β2에 대한 조정이 필요.

Chinchilla 쪽에서의 추측과는 달리 LR Decay의 영향은 크게 없었다고 하네요.

사실 지금 Compute Optimal Scaling은 그렇게 중요하지 않은 상황인 듯 합니다만 Scaling Law 추정 과정에서 주의해야 할 부분들을 밝혀준다는 점에서는 중요한 가이드일 듯 하네요.

#scaling-law

LiveBench: A Challenging, Contamination-Free LLM Benchmark

(Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum)

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

LLM이나 사람의 개입이 필요하지 않은 자동화된 벤치마크이면서 Contamination을 억제한 LLM 벤치마크. 다만 스코어에서 Claude 3.5 Sonnet이 코딩에 대해 너무 압도적인 스코어를 달성했네요. 무언가 놓치고 있는 부분이 있지는 않을까 하는 생각이 있습니다.

별개로 순위는 상당히 예쁘게 나오는 것 같네요.

#benchmark

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

(Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos)

Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., 10.5%10.5% improvement on 2020 documents MDQA at position 1010 for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from 2.33%2.33% to 6.19%6.19%). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

Dictionary에 대한 Key-Value Retrieval 같은 순수하게 인공적인 과제로도 Long Context QA 같은 과제에 대한 성능을 높여줄 수 있다는 결과. 얼마 전 텍스트에서 이미지로, 이미지에서 비디오로 Long Context 능력을 일반화한 경우도 그렇고 (https://arxiv.org/abs/2406.16852) Long Context 능력은 일반화가 굉장히 잘 일어나는 문제인 것 같기도 합니다.

답변 형식을 프롬프트에 추가해서 특정한 답변 형식을 학습하지 않도록 방지한 것, Long Context QA를 인위적으로 구성해서 학습하는 것보다 할루시네이션의 가능성을 억제할 수도 있다는 가능성 등의 측면에서 말끔한 접근이라는 느낌입니다.

#long-context