2024년 12월 5일

Dec 05, 2024

Free Process Rewards without Process Labels

(Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng)

Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline 'a la Math-Shepherd using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.

Policy와 Reference 모델의 log likelihood ratio를 Reward로 Parameterize한 Outcome Reward Model을 통해 Process Reward를 부여할 수 있다는 연구. DPO와 Q의 관계, 토큰 단위 Credit Assignment에 대한 결과의 연장선이라고 할 수 있겠네요. (https://arxiv.org/abs/2404.12358)

This study shows that process rewards can be assigned using an outcome reward model that parameterizes rewards as the log likelihood ratio of policy and reference models. This research can be seen as an extension of previous work on the relationship between DPO and Q, as well as token-level credit assignment. (https://arxiv.org/abs/2404.12358)

#reward-model

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

(Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro)

Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html

NVIDIA의 프리트레이닝 코퍼스 구축 작업. Trafilatura 같은 Extractor가 추출되는 텍스트의 양을 줄이는 경향과 휴리스틱 필터가 텍스트를 지나치게 필터링하는 문제에 대응했군요. 모델 기반 퀄리티 필터링에 힘이 실리고 있습니다.

그리고 합성 데이터를 포함시켰고 Global Deduplication을 사용했군요. 프리트레이닝 코퍼스와 관련된 중요한 문제들을 확인해주는 결과입니다.

NVIDIA's work on building a pretraining corpus. They addressed the tendency of extractors like Trafilatura to reduce the amount of extracted text, as well as the problem of excessive filtering by heuristic rules. Model-based quality filtering is gaining significant traction.

They also incorporated synthetic data and employed global deduplication. This result reveals important issues related to pretraining corpora.

#corpus #pretraining

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

(Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, Limin Wang)

Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction (1.00 rFID) and autoregressive visual generation (2.05 gFID). The code and models are available at https://github.com/TencentARC/SEED-Voken.

VQ에서 선택된 코드만 업데이트 되기 때문에 업데이트 되지 않은 코드들과 인코더 출력의 괴리가 커지는 문제에 대한 해소. 얼마 전 나온 연구와 비슷한 문제의식입니다. (https://arxiv.org/abs/2411.02038) 여기에서는 Softmax의 그래디언트를 사용하는 방법을 사용했네요.

This paper addresses the problem of increasing disparity between non-updated codes and encoder features in VQ, which occurs because VQ only updates selected codes. This concern is similar to that of a recent study (https://arxiv.org/abs/2411.02038). In this research, they employed a method that utilizes the gradient of softmax.

#vq

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

(Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai)

Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement -- a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. Our findings not only advance understanding of LLM self-improvement with practical implications, but also open numerous avenues for future research into its capabilities and boundaries.

Self Improvement이 가능한 것은 Generation과 Verification에 차이가 있기 때문이겠죠. 여기서는 Generation과 Verification의 차이가 모델 학습량의 증가에 따라 증가한다는 것, 그리고 이 갭과 Diversity가 Self Improvement 과정에서 빠르게 감소한다는 것을 발견했습니다.

Self improvement is possible because there is a difference between generation and verification capabilities. In this paper, the authors observed that the gap between generation and verification widens as the model's training flops increases. They also found that this gap and diversity decrease rapidly during the process of self improvement.

#self-improvement #search

Navigation World Models

(Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun)

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.

구글에 더해 (https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/) 메타에서도 액션을 입력으로 받는 이미지 생성 모델을 만들었군요. 상호작용이 가능해야 World Model이라고 할 수 있겠죠.

In addition to Google (https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/), Meta has also developed an image generation model that accepts actions as input. To truly be considered a world model it must allow for interaction.

#world-models

The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control

(Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, Hongyang Zhang)

We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p high-fidelity real-scene video streams with real-time, responsive control in both first- and third-person perspectives, enabling immersive exploration of richly dynamic environments. Trained on limited supervised data from AAA games like Forza Horizon 5 and Cyberpunk 2077, complemented by large-scale unsupervised footage from real-world settings like Tokyo streets, The Matrix allows users to traverse diverse terrains -- deserts, grasslands, water bodies, and urban landscapes -- in continuous, uncut hour-long sequences. Operating at 16 FPS, the system supports real-time interactivity and demonstrates zero-shot generalization, translating virtual game environments to real-world contexts where collecting continuous movement data is often infeasible. For example, The Matrix can simulate a BMW X3 driving through an office setting--an environment present in neither gaming data nor real-world sources. This approach showcases the potential of AAA game data to advance robust world models, bridging the gap between simulations and real-world applications in scenarios with limited data.

이쪽도 게임을 대상으로 한 World Model입니다. 게임과 RL이 다시 딥 러닝 판의 중요한 화두로 귀환할 것 같다는 느낌이 드네요.

This is another example of a world model targeting games. It gives me the impression that games and RL will once again become important topics in the field of deep learning.

#world-models

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

(Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna)

Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

특정한 이미지 과제에 대해서는 Depth Map 같은 정보가 추가로 주어진다면 풀기 쉬워질 수 있죠. Chain of Thought처럼 이런 정보를 생성해 예측하게 만드는 방법입니다. 이미지 생성 능력을 갖춘다면 이미지를 통해 추론하는 것도 재미있는 방향이라고 생각합니다.

현재 Chain of Thought는 어떤 텍스트를 생성할 것인지 자체에 대해서는 Supervision을 주지 않는 것이 흐름인데 이미지에 대해서 만약 이런 형태로 Chain of Thought를 학습시킨다면 어떤 이미지를 생성하게 될지 궁금하네요.

For certain image-related tasks, having additional information like depth maps could make them easier to solve. This method allows the model to predict by generating such information, similar to chain of thought prompting. I think it would be an interesting research direction to enable models to reason through images if they have image generation capabilities.

Currently, the trend in chain of thought research is not to provide supervision on what text should be generated in the thought tokens. If we apply a similar approach to image-based chain-of-thought, I'm curious about what kind of images the model would generate.

#reasoning

PaliGemma 2: A Family of Versatile VLMs for Transfer

(Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai)

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

PaliGemma의 (https://arxiv.org/abs/2407.07726) Gemma 2 버전이 나왔군요. Gemma 2 27B 기반 모델까지.

The Gemma 2 version of PaliGemma (https://arxiv.org/abs/2407.07726) came out. It now includes models based on Gemma 2, up to the 27B version.

#vision-language

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

(Yaoyao Chang, Lei Cui, Li Dong, Shaohan Huang, Yangyu Huang, Yupan Huang, Scarlett Li, Tengchao Lv, Shuming Ma, Qinzheng Sun, Wenhui Wang, Furu Wei, Ying Xin, Mao Yang, Qiufeng Yin, Xingxing Zhang)

Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at https://aka.ms/redstone.

마이크로소프트의 프리트레이닝 코퍼스. Common Crawl에 대한 처리는 기본적인 스타일인데 WARC와 WET 각각에 대해서 했다는 게 좀 특이하네요. 추가적으로 수학, 코드 데이터 수집에 더해 QA 데이터 수집을 따로 했군요.

Microsoft's pretraining corpus. The processing of Common Crawl follows a standard approach, but it's noteworthy that they applied it to both WARC and WET formats. Additionally, they separately collected data for mathematics and code, as well as QA data.

#pretraining #corpus

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

(Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu)

We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256256 resolution, achieving comparable results to SDXL.

인식과 생성을 동시에 추구하는 이미지 토크나이저. 시맨틱 인코더와 픽셀 인코더를 두고, 각 인코더로 인코딩된 Feature와의 거리의 가중합으로 가장 가까운 코드를 찾는 방식입니다. 그리고 시맨틱 디코더를 추가로 사용해서 BEiT v2 Feature와의 차이를 Loss로 사용했군요. (https://arxiv.org/abs/2208.06366)

This is an image tokenizer that aims to achieve both recognition and generation simultaneously. It employs both semantic and pixel encoders, selecting the nearest code by using a weighted sum of the distances between each encoder's features and the codes. Additionally, they incorporated a semantic decoder and used the difference between its output and the features from BEiT v2 (https://arxiv.org/abs/2208.06366) as a loss.

#vq #image-generation #image-text

2024년 12월 5일

Discussion about this post