2024년 11월 26일
Self-Generated Critiques Boost Reward Modeling for Language Models
(Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou)
Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
Critique을 생성하는 Reward Model. Reward Score 예측 + Critique을 생성한 다음 Preference Label을 사용한 필터링과 LLM을 사용한 개선 결과를 사용해 Critique 학습.
A reward model that generates critiques. Trained model to predicts reward scores and generates critiques, utilizing preference labels for filtering and LLM based refinement on samples critiques from the model.
#reward-model
Predicting Emergent Capabilities by Finetuning
(Charlie Snell, Eric Wallace, Dan Klein, Sergey Levine)
A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.
현재 랜덤 성능이 나오는 모델만이 있을 때 어느 시점에서 Emergence가 발생하는지를 예측할 수 있는지에 대한 문제. 파인튜닝을 했을 때 Emergence가 발생하는 지점이 이동하고, 또 파인튜닝 데이터의 규모 증가에 따라 이 이동이 비례한다는 것을 기반으로 모델링을 했습니다. 파인튜닝 데이터의 양에 따른 Emergence 지점에 대한 공식이 있다면 파인튜닝 데이터의 양을 0으로 지정했을 때 나오는 지점이 Few-shot에서 Emergence가 발생하는 지점이라는 것이죠.
The problem of predicting when emergence will occur in models that only show random accuracy. The authors modeled this based on the fact that the point of emergence is left shifts when finetuned, and the degree of this shift is proportional to the amount of finetuning data. They propose that if we have formula for the point of emergence as a function of finetuning data size, then we can predict the point of emergence for few-shot learning by setting the finetuning data size to 0.
#scaling-law