2024년 4월 29일

Apr 29, 2024

Snowflake에서 요즘 아마 가장 흥미로울 주제일 프리트레이닝에 대한 포스트가 공개됐네요. 요약하자면 C4, RefinedWeb에 KenLM 기반 필터링, 그리고 Common Crawl 데이터에 대해 실험을 반복하면서 필터링 조건들을 설정했습니다.

여기에 GitHub와 PyPi에서 데이터를 수집하고 Deduplication과 Dependency 기반 Topological Sort. 여기에 프로그래밍 관련 웹 문서들을 추가 발굴 + OpenWebMath, Cosmopedia, OpenWebText 등. 결과적으로 Common Crawl 기반 데이터가 메인이긴 합니다.

DeepSeek Math도 그렇고 Common Crawl에서 높은 가치의 데이터를 발굴하는 것에만 집중해도 많은 것을 할 수 있는 듯 합ㄴ니다. (https://arxiv.org/abs/2402.03300) 다만 이 포스트에서 Common Crawl의 CCBot을 차단하는 사례가 최근 늘고 있어 최근 덤프보다 과거 덤프의 가치가 더 높은 것 같다는 흥미로운 이야기를 하고 있네요. 최근 가치가 높은 소스이지만 CCBot을 차단한 사례들이 있는 듯 하네요. 이게 웹 크롤 데이터를 쓰는데 있어 문제가 될 수 있겠다는 생각도 듭니다. (그런 의미에서 구글은 웹 크롤에서도 강점을 갖고 있겠다 싶네요. 구글 봇을 차단하고 싶은 경우는 거의 없을 테니까요.)

그와는 별개로 Reka는 5T 이상의 데이터를 구축하면서 어떻게 웹 크롤 데이터의 비중을 25% 정도로 유지했는지 궁금하긴 합니다.

#pretraining #dataset

REBEL: Reinforcement Learning via Regressing Relative Rewards

(Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun)

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.

DPO Loss를 샘플의 Reward 차이에 대해 Regression 하는 형태의 Loss군요. 결과는 PPO와 비슷한 것 같긴 합니다.

#rlhf

2024년 4월 29일

Snowflake Arctic Cookbook Series

REBEL: Reinforcement Learning via Regressing Relative Rewards

Discussion about this post