2024년 6월 25일

Jun 25, 2024

WARP: On the Benefits of Weight Averaged Rewarded Policies

(Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem)

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

Policy의 EMA에 대한 KL Penalty를 걸고 학습을 진행, 학습한 여러 모델을 Spherical Interpolation으로 머지, 머지된 모델과 초기 모델의 Linear Interpolation으로 새로운 Inital Policy를 생성, 그리고 반복이라는 형태의 Alignment 알고리즘입니다. 전반적으로 여러 방향으로 보다 빠르게 탐색하고, 탐색 결과를 통합하고, 탐색 결과가 초기 모델에서 너무 멀어지지 않도록 조정한 다음 다시 탐색한다는 느낌이네요.

#alignment #rlhf

Long Context Transfer from Language to Vision

(Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu)

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

Long Context LLM을 이미지-텍스트 페어에 대해서만 정렬한 다음 테스트 시점에서는 이미지 대신 비디오를 입력해봤습니다. 텍스트에 대한 Long Context 능력이 이미지에 대해서도 일반화 되고 모달리티를 뛰어넘을 수도 있다는 결과네요.

전반적으로 Long Context 능력은 Positional Encoding과 모델이 어떤 Attention 분포를 경험했는가의 문제가 아닌가 하는 생각이 있습니다.

#video-language #long-context

Reconciling Kaplan and Chinchilla Scaling Laws

(Tim Pearce, Jinyeop Song)

Kaplan et al. [2020] ('Kaplan') and Hoffmann et al. [2022] ('Chinchilla') studied the scaling behavior of transformers trained on next-token language prediction. These studies produced different estimates for how the number of parameters (N) and training tokens (D) should be set to achieve the lowest possible loss for a given compute budget (C). Kaplan: N_optimal ∝ C^0.73$, Chinchilla: N_optimal ∝ C^0.50. This note finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan's. Hence, this note reaffirms Chinchilla's scaling coefficients, by explaining the cause of Kaplan's original overestimation.

Kaplan Scaling Law의 Model Scaling은 C^0.73으로 Chinchilla Scaling의 C^0.5와는 큰 차이가 있죠. 이 차이의 원인을 최적화 전략의 차이나 데이터셋의 차이로 생각했었는데, Kaplan Scaling Law에서 파라미터 수에서 임베딩을 빼고 생각했다는 것이 작은 규모에서 실험한 것과 맞물려 차이를 만든 것이라는 설명입니다.

차이를 발견하자면 얼마든지 발견할 수 있을 텐데 결과적으로 비슷한 Scaling Law로 귀결된다는 것이 놀랍네요.

#scaling-law

Probing the Decision Boundaries of In-context Learning in Large Language Models

(Siyan Zhao, Tung Nguyen, Aditya Grover)

In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.

In-context Learning으로 Toy Task들에 대한 분류 과제를 수행하고 Decision Boundary를 추정한 실험. 기본적으로 Decision Boundary는 지저분합니다. 그런데 Classification 과제에 대해 튜닝을 하면 다른 과제에 대해서도 좀 더 깨끗한 Decision Boundary를 갖게 된다고 하네요. 재미있군요.

#in-context-learning

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

(Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie)

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Multimodal 모델의 컴포넌트, 이미지 인코더, 커넥터 등등에 대한 비교 분석. 5 페이지에 등장한 아주 근본 넘치는 이미지가 반갑네요. 그나저나 르쿤 선생님도 뉴욕대에서는 LLM 논문 저자로 들어가시긴 하네요.

개인적으로는 From Scratch 혹은 그에 가까운 멀티모달 연구 결과도 재미있는 방향이 아닐까 싶습니다. 다만 연산요구량에 너무 현격한 차이가 있으니 쉽지는 않겠네요.

#vision-language #multimodal