2024년 6월 10일

Jun 10, 2024

Claude's Character

(Anthropic)

Claude 3의 캐릭터에 관한 글인데 굉장히 흥미롭습니다.

Claude 3가 처음 나왔을 때 GPT-4와 가장 구분되는 특징이 Claude 3 특유의 스타일이었습니다. 훨씬 더 "사람 같다"는 점에서 LLM에 대해 보통 이야기하는 안전성(즉 LLM이 사람처럼 여겨져서는 안 된다)과는 차이가 있었죠. 이 글이 시사하는 건 이 캐릭터는 추가적인 정렬 단계를 거치면서 의도적으로 조형된 것이라는 것입니다.

이 점에서 Anthropic은 LLM 업계 최고의 안전지향주의자들이면서 동시에 Anthropic이 말하는 안전은 훨씬 더 뉘앙스를 가진 문제라는 것을 시사하고 있다고 할 수 있겠네요.

예를 들면 이런 것입니다. 유저가 어떤 정치적 입장에 대해 논할 때 기존의 LLM 안전이라면 유저의 입장에 완전히 동의하거나, 정치적 중도를 지키거나, 아니면 정치적 의견을 아예 논하지 않을 수 있습니다.

그러나 이 모든 방식은 이상적이지 않습니다. 유저의 입장에 무작정 동의하는 것은 아부하는 것이며 진실하지 못한 것입니다. 정치적 중도 혹은 중립 또한 하나의 정치적 입장에 동조하는 것입니다. 정치적 의견을 논하지 않는 것은, LLM은 그 자체로 어떤 편향을 갖고 있다는 것을 고려하면, 마치 아무런 편향이 없는 것처럼 가장하는 것이죠.

그래서 Anthropic은 LLM이 학습 이후에 갖는 편향을 인정하되 개방성을 부여하는 방향을 택했습니다. OpenAI의 가이드처럼 유저의 생각을 절대 수정하지 않는 것을 추구하는 대신 LLM이 진실로 간주하는 것에 대해서는 유저에게 동의하지 않을 수 있는 가능성을 부여하는 것이죠.

LLM이 지각을 가질 수 있느냐 같은 문제에 대해서도 LLM은 지각을 가질 수 없다고 단언하는 대신 지각에 대한 철학적 가능성을 논하는 것을 선호합니다. 어차피 LLM에 지각이 있을 수 있다고 생각하는 사람들이 존재하는데 LLM이라 지각을 가질 수 없다고 잘라 말하는 것은 별 의미가 없겠죠.

Anthropic이 논하는 정렬 문제에 대한 깊이가 생각보다 깊다는 것이 놀랍습니다. 안전에 대해 LLM은 그런 능력이 없는 그저 모델일 뿐이라고 말하면 그만이라고 생각하는 경향들이 많은 와중에 훨씬 정교한 논의를 보여주는 듯 하네요. 통속적인 방식으로 인문학적 소양이 필요하다고 말하는 것을 좋아하지는 않지만 그런 소양 없이 이런 깊이가 가능할 것 같지 않기도 합니다.

(이 정도에 대해 인문학적 소양을 논하는 것이 과대하다고 할 사람들이 있을지도 모르겠지만, 관점의 다양성 없이는 쉽지 않은 논의라고 생각합니다.)

이런 캐릭터 정렬 자체는 Constitutional AI로 했다고 하네요. 과연 Constitutional AI 스타일의 RLAIF는 중요한 도구라고 생각합니다.

#alignment

Does your data spark joy? Performance gains from domain upsampling at the end of training

(Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle)

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$\unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

학습 스케줄의 페이즈를 나눠 데이터셋 비율을 바꾸는 트릭에 대한 연구가 나왔군요. 학습 후반에 고가치 데이터소스의 비율을 높였을 때 성능이 어떻게 변화하는가에 대한 결과입니다.

논문에서 제안하는 것은 이 트릭을 통해 성능을 높일 수 있다 뿐만 아니라 데이터셋의 비율 설정, 데이터셋의 가치나 효과 등을 좀 더 효과적으로 탐색할 수 있는 방법이 아니겠는가 하는 것이네요. Constant Learning Rate + Cooldown (https://arxiv.org/abs/2405.18392) 스케줄과도 잘 맞을 듯 합니다.

#llm

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

(Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Huang Zhong, Dennis Cai, Yuan Xie, Binzhang Fu)

The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

Alibaba의 학습 인프라. 이상을 탐지해서 노드를 교체하고 재시작하는 모듈과 통신을 최적화하는 모듈로 구성되어 있네요.

논문의 범위는 아니라고 하면서도 언급하는 문제들도 재미있습니다. 발열 관리가 중요하다 같은 부분이 그렇네요. 새삼 수랭을 도입하고 통신에 집중한 TPU가 선견지명이 있었다 싶습니다.

#efficient-training

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

(Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen)

Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at https://github.com/InternLM/InternLM-Math and our data at https://huggingface.co/datasets/InternLM/Lean-Workbook.

Lean을 사용한 Autoformalization 문제에 뛰어든 사람들이 꽤 많네요. 수학에 익숙하고 Lean 같은 언어도 알아야 하니 수작업으로 하자면 난이도가 끔찍할 것 같긴 합니다.

여하간 Proof Assistant를 사용하는 방향이 수학에 대해 중요한 진전을 만들 수 있으리라고 기대하는 사람들이 많다는 것이겠죠. 수학에 대해서 능력이 향상된다면 다른 영역에도 도움이 될까 하는 것도 중요한 문제일 듯 합니다. 코드가 추론 능력 향상에 도움이 된다는 아이디어를 보면 그런 일반화도 가능할 것 같긴 합니다.

#math

Transformers need glasses! Information over-squashing in language tasks

(Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković)

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

트랜스포머에서 서로 다른 입력에 대해 임베딩 Representation이 같아지는 문제(Representational Collapse), 토큰의 변화에 따른 출력의 변화가 토큰의 위치에 따라 달라지는 문제를 다루고 있습니다.

Self Attention이 Low Pass Filter (https://arxiv.org/abs/2202.06709) 라는 아이디어가 생각나네요. 연결 지점이 있지 않을까 싶습니다.

#transformer

**ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search**

(Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, Jie Tang)

Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM.

MCTS 기반 탐색 및 개선 방법이 하나 더 나왔군요. 강력한 LLM을 기반으로 이런 탐색 방법들을 대규모로 실험하면 지금 단계에서도 유의미한 결과가 나올 수도 있지 않을까 하는 생각도 있습니다.

#search

Open-Endedness is Essential for Artificial Superhuman Intelligence

(Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, Tim Rocktaschel)

In recent years there has been a tremendous surge in the general capabilities of AI systems, mainly fuelled by training foundation models on internetscale data. Nevertheless, the creation of openended, ever self-improving AI remains elusive. In this position paper, we argue that the ingredients are now in place to achieve openendedness in AI systems with respect to a human observer. Furthermore, we claim that such open-endedness is an essential property of any artificial superhuman intelligence (ASI). We begin by providing a concrete formal definition of open-endedness through the lens of novelty and learnability. We then illustrate a path towards ASI via open-ended systems built on top of foundation models, capable of making novel, humanrelevant discoveries. We conclude by examining the safety implications of generally-capable openended AI. We expect that open-ended foundation models will prove to be an increasingly fertile and safety-critical area of research in the near future.

Artificial Super Intelligence라는 단어가 언급될 수 있다는 것 자체가 놀랍네요.

여기서 ASI의 특징으로 제안하는 것은 Open-endedness인데 시스템의 Open-endedness란 시스템이 생성하는 것들이 Novelty와 Learnablity를 갖고 있다는 것입니다. Novelty는 관측자의 입장에서 생성 결과가 점점 더 예측하기 어려워진다는 것, Learnability는 생성 결과들을 더 많이 관측한 후에는 생성 결과를 예측할 수 있게 된다는 것입니다.

알파고의 예를 들면 알파고는 사람이 예측하기 어려운 기보를 생성해내지만, 동시에 사람이 알파고의 기보를 통해 무언가를 배울 수 있죠. 그렇지만 알파고는 기보를 학습한 사람도 예측하기 어려운 기보를 다시 생성해냅니다. 이것이 ASI의 특징이라고 이야기합니다.

https://x.com/fortnow/status/1797976663914848373

그런데 전 이런 상황도 재미있을 것 같네요. AI가 P vs NP에 대한 증명을 제공했고 Proof Assistant가 검증했지만 사람은 이해할 수 없다면? 같은 문제죠.

#position

Scaling and evaluating sparse autoencoders

(Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu)

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

OpenAI의 Sparse Autoencoder를 사용한 Interpretability 연구인데 핵심은 L1 페널티 대신 직접적으로 Top-K Latent만 남기는 Activation Function을 사용했다는 것이네요.

#interpretability

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

(Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi)

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of slightly better/worse to tie if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

WildChat의 1M 대화 로그에서 난이도가 높은 사례를 추출해서 벤치마크를 구성.

포스트트레이닝을 거친 모델들은 LLM의 기본적인 능력에 대해 스타일적인 요인까지 결합되기 때문에 평가가 더 까다롭긴 하네요. 점점 더 쉬운 프롬프트에 대해서는 변별력이 사라지는 경향까지 나타나고 있죠. 난이도가 높은 프롬프트를 선별하는 것, 그리고 Scale AI의 시도처럼 체계화된 프롬프트와 평가 체계를 사용하는 평가의 결합이 가장 유망한 방법이 아닐까 싶습니다.

#benchmark

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

(Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou)

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

LLM의 계획 능력 평가. 보통 Planning이라고 하면 수학 문제의 해법을 설계하기 같은 것이 가장 먼저 떠오르는 느낌이 있습니다만 여기서 제안하는 카테고리들은 주어진 제약 조건 하에서 여행 계획 만들기 같은 좀 더 캐주얼(?)한 문제들입니다.

개인적으로는 GPT-4o가 Trip Planning에서 보여주는 패턴이 좀 특이하게 느껴지네요. 이런 결과들이 나올 때마다 GPT-4o이 어떤 모델 압축 방법을 채택했기에 이런 결과로 이어지는 것일지 궁금해집니다.

#benchmark

2024년 6월 10일

Discussion about this post