2024년 10월 25일

Oct 25, 2024

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

(Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar)

A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

학습 초반에는 작은 모델로 Distill을 한 다음 일반적인 학습을 시작하는 방법. 작은 모델로 Distill을 하면 쉽게 학습할 수 있는 것에 우선순위가 주어지게 되고 어려운 토큰들은 후반의 학습으로 미룰 수 있다는 아이디어네요. 추가적으로 작은 모델에서 Loss가 높은 샘플들 중에서 예측 가능한 토큰들을 사용해 필터링을 하는 방법도 생각했습니다. 전반적으로 커리큘럼이나 Learnability 문제와 관련해서 생각할 수 있을 것 같네요.

This method begins by distilling knowledge from smaller models before transitioning to standard training. The idea is that distillation from smaller models prioritizes easily learnable data, allowing more difficult tokens to be addressed in later stages of training. Additionally, they proposed a filtering method that selects predictable tokens from high-loss samples identified by the smaller models. Overall, this approach can be considered in relation to curriculum learning or learnability problems.

#distillation

Stable Consistency Tuning: Understanding and Improving Consistency Models

(Fu-Yun Wang, Zhengyang Geng, Hongsheng Li)

Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

Consistency Model 학습 안정화 테크닉. 단일한 방법이라기보단 Phased Training 같은 여러 개선들의 결합이네요. (https://arxiv.org/abs/2405.18407) Consistency Model에 대한 안정성 개선 결과가 연달아 나오고 있군요.

A technique for stabilizing the training of consistency models. Rather than a single method, it's a combination of various improvements, similar to phased training. (https://arxiv.org/abs/2405.18407) Many results on improving the stability of consistency models came out lately.

#diffusion

Why Does the Effective Context Length of LLMs Fall Short?

(Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong)

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Relative Distance를 생각했을 때 가까운 거리에 비해 멀리 떨어진 거리는 빈도가 작기 때문에 학습이 어렵다는 추정. Sliding Window Attention으로 가까운 거리를 모델링하고 멀리 떨어진 거리의 토큰들에 대해서는 위치를 재지정하는 것으로 대응. ReRoPE가 생각나네요. (https://github.com/bojone/rerope) 학습 시점에서부터 쓸 수 있는 무언가 좋은 Position Encoding 세팅이 있지 않을까 하는 생각이 듭니다.

When considering relative distances, the frequency of tokens far apart is much lower compared to those closer together, making it difficult to learn long-range relationships. The authors address this by using sliding window attention to model nearby distances, while reassigning positions for tokens that are far apart. This reminds me of ReRoPE (https://github.com/bojone/rerope). I wonder if there might be a better position encoding configuration that could be used from the pretraining stage itself.

#long-context

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

(Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang)

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.

레퍼런스 모델과 정렬할 모델의 Likelihood의 비율을 정렬된 작은 모델의 Likelihood의 비율과 맞추는 방법으로 정렬하는 방법. 큰 모델에서도 Likelihood의 비율이 지나치게 달라져선 안 된다는 아이디어로 생각할 수 있을 듯 하네요.

This method aligns models by matching the likelihood ratio between the reference model and the model to be aligned with the likelihood ratio of smaller, already aligned models. This can be viewed as a form of regularization, ensuring that the likelihood ratio doesn't deviate excessively even for larger models.

#alignment

Unbounded: A Generative Infinite Game of Character Life Simulation

(Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, David E. Jacobs, Michael Rubinstein, Mohit Bansal, Nataniel Ruiz)

We introduce the concept of a generative infinite game, a video game that transcends the traditional boundaries of finite, hard-coded systems by using generative models. Inspired by James P. Carse's distinction between finite and infinite games, we leverage recent advances in generative AI to create Unbounded: a game of character life simulation that is fully encapsulated in generative models. Specifically, Unbounded draws inspiration from sandbox life simulations and allows you to interact with your autonomous virtual character in a virtual world by feeding, playing with and guiding it - with open-ended mechanics generated by an LLM, some of which can be emergent. In order to develop Unbounded, we propose technical innovations in both the LLM and visual generation domains. Specifically, we present: (1) a specialized, distilled large language model (LLM) that dynamically generates game mechanics, narratives, and character interactions in real-time, and (2) a new dynamic regional image prompt Adapter (IP-Adapter) for vision models that ensures consistent yet flexible visual generation of a character across multiple environments. We evaluate our system through both qualitative and quantitative analysis, showing significant improvements in character life simulation, user instruction following, narrative coherence, and visual consistency for both characters and the environments compared to traditional related approaches.

LLM으로 자유로운 상호작용이 가능한 게임 만들기. 그런데 상황에 걸맞는 장면 이미지를 생성하는 작업에 많은 노력을 들였네요.

AI 던전이 LLM의 초창기에 인상적인 결과물 중 하나였죠. 그 생각이 나는군요.

Creating a game with free-form interactions using LLMs. They've put a lot of effort into generating scene images that match the situations.

This reminds me of AI Dungeon, which was one of the most impressive early applications of LLMs.

#llm