2024년 2월 23일

Feb 23, 2024

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

(Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, Sara Hooker)

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. \textsc{Proximal Policy Optimization} (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the \textit{formulation} of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

구글에 이어 PPO를 벗어난 사례가 등장했군요. REINFORCE에 N개 샘플을 사용해 베이스라인을 계산하는 RLOO를 시도했습니다. N개 샘플을 사용하는 시도가 최근 많이 등장하네요.

#rlhf

Cleaner Pretraining Corpus Curation with Neural Web Scraping

(Zhipeng Xu, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Chenyan Xiong, Ge Yu)

The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.

모델 기반 HTML Extractor. Trafilatura 같은 Extractor와 비교하고 있는데 코퍼스 전반의 측면에서 CCNet 같은 WET를 사용하는 시도와의 차이는 어떨지 궁금하네요.

#dataset #corpus

2024년 2월 23일

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Cleaner Pretraining Corpus Curation with Neural Web Scraping

Discussion about this post