2025년 6월 27일

Jun 27, 2025

Bridging Offline and Online Reinforcement Learning for LLMs

(Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov)

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

LLM에 대한 Online vs Offline RL 비교. Online RL이 낫다는 자연스러운 결론. 이전부터 Offline RL 좀 하지 말라는 사람들이 많았죠.

Comparison of online vs offline RL for LLMs. The conclusion that online RL is better comes as no surprise. Many have been advising against using offline RL for some time now.

#rl

Data Efficacy for Language Model Training

(Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li)

Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.

LLM 학습의 데이터 Efficacy. Efficiency와는 다르게 같은 데이터의 재배치를 통해서 성능을 높인다는 관점. 여기서는 Folding이라는, 커리큘럼을 여러 번 반복하는 방법을 시도했군요.

같은 데이터에서 더 많은 것을 학습하는 방법은 앞으로 더더욱 중요한 문제겠죠.

Data efficacy for LLM training. Unlike efficiency, this approach focuses on enhancing performance by reorganizing the same data. In this work, they tried a method called folding, which is repeating the same curriculum multiple times.

I believe that methods for extracting more learning from the same data will become increasingly important in the future.

#pretraining #llm

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

(Ziyue Li, Chenrui Fan, Tianyi Zhou)

Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.

LLM 학습에서 Grokking이 일어나는가에 대한 연구. 여기서는 그런 현상을 발견했고, 일반화가 일어나는 시점에서 샘플 각각이 아니라 샘플들에 대한 공통 패턴으로 수렴하는 과정을 관찰했네요.

A study on whether grokking occurs in LLM training. The researchers found evidence of this phenomenon and observed that the model converges to shared patterns among samples, rather than fitting individual samples, when generalization occurs.

#generalization #llm

2025년 6월 27일

Bridging Offline and Online Reinforcement Learning for LLMs

Data Efficacy for Language Model Training

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Discussion about this post