2025년 7월 11일

Jul 11, 2025

Why is Your Language Model a Poor Implicit Reward Model?

(Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora)

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Implicit Reward로 Parameterization 하는 것이 명시적인 Reward 헤드보다 일반화 성능이 떨어지는 문제에 대한 분석. Implicit Reward의 경우에 토큰 단위 단서를 활용하는 경향이 크다고 하네요.

An analysis of the problem where implicit reward parameterization leads to worse generalization compared to using explicit reward heads. The authors argue that implicit rewards tend to rely more heavily on token-level cues.

#reward-model #generalization

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

(Mingkai Jia, Wei Yin, Xiaotao Hu, Jiaxin Guo, Xiaoyang Guo, Qian Zhang, Xiao-Xiao Long, Ping Tan)

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks} across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

고성능 VQ. 채널을 키우고 하나의 패치에 여러 코드북을 사용해서 여러 개의 코드를 추출하는 방식이군요.

A high-performance VQ. It increases the number of channels and extracts multiple codes from a single patch using multiple codebooks.

#vq

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

(Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu)

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

추론 SFT와 RL의 비교. SFT가 고난이도 문제에 대한 향상이 크고 RL은 성능이 고르게 올라간다고. 둘을 섞으면 보완이 아니라 중간 지점으로 간다고 하네요. RL로 고난이도 문제에 대한 성능을 높이는 것을 생각하는 것이 좋은 방향일지도.

Comparison of reasoning SFT and RL. SFT shows greater improvement for high-difficulty problems, while RL enhances performance more consistently across all levels. When combined, instead of complementing each other, they result in a middle ground. Perhaps focusing on how to improve performance on difficult problems using RL might be a better direction.

#rl #reasoning

SAS: Simulated Attention Score

(Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao)

The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

파라미터의 큰 증가 없이 Attention 헤드를 늘려본다는 아이디어. 연산 증가가 성능으로 이어진 것은 아닌가 싶기도 하네요.

The idea of increasing the number of attention heads without significantly increasing the number of parameters. I wonder if the increase in computational complexity is translated into improved performance.

#transformer

2025년 7월 11일

Why is Your Language Model a Poor Implicit Reward Model?

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

SAS: Simulated Attention Score

Discussion about this post