2024년 5월 30일

May 30, 2024

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

(Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang)

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

Iterative한 Online RLHF 시나리오에서 Reward Model이 Uncertain한 영역에서 Reward를 낮게 설정하기 때문에 이 영역에 대한 탐색이 제한된다는 아이디어. 이 영역에 대해 Reward를 높게 설정하도록 해서(Optimistic) 탐색을 촉진한다는 발상입니다.

주로 Pessimism이 등장하던 판에 Optimism이 등장하니 흥미롭네요.

#rlhf #alignment

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

(Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn Guo, Soren Gao, Wangchunshu Zhou, Xinyue Zhang, Yizhi Zhou, Yubo Wang, Yuelin Bai, Yuhan Zhang, Yuxiang Zhang, Zenith Wang, Zhenzhu Yang, Zijian Zhao, Jiajun Zhang, Wanli Ouyang, Wenhao Huang, Wenhu Chen)

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

얼마 전 소개했던 MAP-Neo에 대한 테크니컬 리포트. 데이터셋을 이미 공개한 만큼 데이터셋의 구축 과정에 대한 설명이 상세합니다. 01.AI 사람들이 참여했으니 Yi 모델들에 들어간 데이터셋의 처리 과정과도 관계가 있을지도 모르겠지만 그 부분은 알 수 없네요.

#llm #dataset

Contextual Position Encoding: Learning to Count What's Important

(Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar)

The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (PE) makes it possible to address by position, such as attending to the i-th token. However, current PE methods use token counts to derive position, and thus cannot generalize to higher levels of abstraction, such as attending to the i-th sentence. In this paper, we propose a new position encoding method, Contextual Position Encoding (CoPE), that allows positions to be conditioned on context by incrementing position only on certain tokens determined by the model. This allows more general position addressing such as attending to the i-th particular word, noun, or sentence. We show that CoPE can solve the selective copy, counting and Flip-Flop tasks where popular position embeddings fail, and improves perplexity on language modeling and coding tasks.

QK 행렬에 Sigmoid를 적용한 다음 Cumulative Sum을 해서 Attention bias로 적용하는 형태의 Positional Encoding. QK 행렬을 사용해 Positional Encoding으로 사용하는 것은 얼마 전 나온 CAPE (https://arxiv.org/abs/2405.14722) 와도 유사하네요.

트랜스포머의 많은 문제가 Positional Encoding에서 기인하죠. (https://arxiv.org/abs/2405.17399) 다만 대안적 Positional Encoding이 문제에 특화되어 있거나 효율적으로 계산하기 어렵다는 것이 문제이긴 하네요.

#positional-encoding

2024년 5월 30일

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Contextual Position Encoding: Learning to Count What's Important

Discussion about this post