2025년 5월 27일

May 27, 2025

Learning to Reason without External Rewards

(Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song)

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

모델의 Certainty 기반으로 Verifier 없이 추론 RL (https://arxiv.org/abs/2504.05812). RL로 모델에 새로운 능력을 주입할 수 있는가라는 논쟁에 더해 데이터 없이 성능을 향상시키는 시도가 계속 나오겠죠. 좋은 방향일까요.

Reasoning RL without verifiers based on the model's certainty of its outputs (https://arxiv.org/abs/2504.05812). In addition to the debate on whether RL can inject new capabilities into models, we'll likely see continued attempts to improve performance without additional data. Would it be a good direction?

#rl #reasoning

Hybrid Latent Reasoning via Reinforcement Learning

(Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang)

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

임베딩의 가중합으로 Latent Thinking을 구현하겠다는 아이디어가 생각보다 많이 나오네요 (https://arxiv.org/abs/2505.14827, https://arxiv.org/abs/2505.15778). 이쪽은 RL 학습도 했습니다.

The idea of implementing latent thinking using a weighted sum of embeddings seems to be appear more frequently than I expected (https://arxiv.org/abs/2505.14827, https://arxiv.org/abs/2505.15778). This work also did RL training.

#rl #reasoning

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

(Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang)

Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at
https://seed-enigmata.github.io
.

퍼즐로 추론 학습시키기.

Training reasoning using puzzles.

#rl #reasoning #synthetic-data

FP4 All the Way: Fully Quantized Training of LLMs

(Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry)

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately sqrt(3) times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

MXFP4 기반 학습을 하려는 시도도 조금씩 나오고 있네요. 일단 Loss 차이가 나타나긴 합니다만.

Attempts at training based on MXFP4 are gradually emerging. However, there are currently differences in loss.

#quantization

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

(Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis)

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

GRPO에서 정답 시퀀스의 확률도 낮아지는 현상에 대한 분석. 모든 토큰을 Negative로 처리하는 것이 문제이고 따라서 토큰 단위 Reward를 주어야 한다고.

Analysis of the phenomenon where the probability of correct sequences decreases in GRPO. The main issue is treating all tokens uniformly as negative, suggesting the need for token-level rewards.

#rl #reasoning

2025년 5월 27일

Learning to Reason without External Rewards

Hybrid Latent Reasoning via Reinforcement Learning

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

FP4 All the Way: Fully Quantized Training of LLMs

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Discussion about this post