2025년 8월 19일

Aug 19, 2025

Reinforcement Learning with Rubric Anchors

(Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao)

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

루브릭 기반 Reward Model 하나 더. 루브릭을 어떻게 만들었는지, 루브릭과 부합하는 데이터를 어떻게 구축했는지, 루브릭이 어떻게 디자인되고 업데이트 되는지에 대한 상세는 별로 없다. 물론 대략적인 감은 잡을 수 있음.

Yet another rubric-based reward model study. There aren't many details about how they constructed the rubrics, built data that aligns with these rubrics, or how the rubrics are designed and updated. Still we can get a sense of it.

#reward-model

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

(David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge)

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.

소규모의 실험 결과를 대규모의 학습에 적용하려면 판단 근거가 되는 벤치마크가 신호 - 서로 다른 모델의 점수 차이 - 대 잡음 - 학습 과정에 의한 점수의 분산 - 비가 높은 것이 이상적. 신호 대 잡음비를 높이려면 점수를 평균내거나 Perplexity로 척도를 바꾸는 것 등이 가능.

To apply small-scale experimental results to large-scale training, the benchmark used as decision criteria should ideally have a high signal - the difference in scores between different models - to noise - the variance in scores during the training process - ratio. To increase the signal-to-noise ratio we can average scores or change the metric to perplexity.

#benchmark #scaling-law

Discussion about this post

Ready for more?