2025년 1월 2일

Jan 02, 2025

InfAlign: Inference-aware language model alignment

(Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami)

Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.

Best of N 같은 추론 시점 연산을 사용할 때의 Win rate를 최대화하도록 하는 RL. Reward에 대한 캘리브레이션과 지수 변환으로 구성되어 있습니다.

An RL method that maximizes the win rate when using test-time compute techniques like Best-of-N. It consists of reward calibration and an exponential transformation of reward scores.

#alignment #test-time-compute #rl

Bootstrap Your Own Context Length

(Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, Furu Wei)

We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.

Long Context Instruction Tuning. 프롬프트 증폭, 관련 문서 검색, 프롬프트와 관련된 내용 요약, 프롬프트와 요약을 기반으로 최종 응답 생성의 순서로 데이터를 구축하는군요.

Long-context instruction tuning. The data is constructed through a series of steps, prompt expansion, retrieval of relevant documents, summarization of content related to the prompt, and generation of the final response based on the prompt and summary.

#long-context #instruction-tuning

On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

(Aleksandar Terzić, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, Abbas Rahimi)

Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations. Our code is available at https://github.com/IBM/selective-dense-state-space-model.

State Space Model의 표현력에 대한 연구. (https://arxiv.org/abs/2404.08819, https://arxiv.org/abs/2405.17394) 여기에서는 각 스텝의 Transition 행렬을 스텝마다 다르게 만드는 방법으로 접근했네요.

최근에 RWKV-7도 비슷한 문제를 이야기하던데 RWKV-7은 Test Time Train을 사용했군요. (https://github.com/BlinkDL/RWKV-LM)

A study on the expressive power of state space models (https://arxiv.org/abs/2404.08819, https://arxiv.org/abs/2405.17394). This research approaches the problem by using different transition matrices for each step.

Recently, RWKV-7 also addressed a similar issue, but it used test-time training instead (https://github.com/BlinkDL/RWKV-LM).

#state-space-model

Multi-matrix Factorization Attention

(Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum)

We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

DeepSeek-V3 덕에 MLA에 대한 주목도 좀 더 늘어난 느낌이군요. MLA에서 Key/Value에 대한 Up projection 행렬을 합쳐놓은 형태네요.

Thanks to DeepSeek-V3, it seems that attention on MLA has increased a bit more. This method appears to be a form where the up-projection matrices for key and value in MLA are combined.

#attention #efficiency

2025년 1월 2일

InfAlign: Inference-aware language model alignment

Bootstrap Your Own Context Length

On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

Multi-matrix Factorization Attention

Discussion about this post