2025년 2월 10일

Feb 10, 2025

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

(Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang)

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

BSQ 기반 이미지 토크나이저를 (https://arxiv.org/abs/2406.07548) CLIP과 Reconstruction Loss를 기반으로 학습. 이 토크나이저를 통해 Image-Text Autoregressive 모델을 학습해봤군요.

The authors trained a BSQ-based (https://arxiv.org/abs/2406.07548) image tokenizer using CLIP and reconstruction loss. They then used this tokenizer to train autoregressive image-text models.

#autoregressive-model #clip #vq

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

(Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein)

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

Universal Transformer가 Inference-time Scaling과 결부되어 다시 등장했군요.

Universal transformer has resurfaced with inference-time scaling.

#transformer #inference-time-scaling

Goku: Flow Based Video Generative Foundation Models

(Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu)

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

Rectified Flow 기반 비디오 생성 모델.

Video generation model with rectified flow.

#video-generation #diffusion

Video generation model based on rectified flow.

Training Language Models to Reason Efficiently

(Daman Arora, Andrea Zanette)

Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.

RL CoT를 위한 Length Penalty. Kimi k1.5에서 사용한 것과 비슷하긴 합니다. (https://arxiv.org/abs/2501.12599)

Length penalty for RL CoT. This is similar to what was used in Kimi k1.5 (https://arxiv.org/abs/2501.12599).

#reasoning #rl

Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization

(Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu)

Training large language models (LLMs) with high-quality Chain-of-Thought (CoT) annotations has become a widely adopted strategy due to its significant enhancement of reasoning capabilities. To fully comprehend this approach, two questions naturally arise: (Q1) What advantages does training with CoT offer compared to training without CoT? (Q2) If there are advantages, what are the underlying mechanisms of explicit CoT training? Analyzing the advantages and mechanisms of CoT training is challenging due to the many factors involved. To address this, we conduct a detailed analysis using clear and controllable data distributions and, for the first time, reveal that CoT training offers the following advantages: (1) Training with CoT markedly improves reasoning generalization, extending it from in-distribution (ID) to both ID and out-of-distribution (OOD) scenarios, while also speeding up convergence; (2) Even when training with CoT includes a certain range of erroneous reasoning steps, it still enables the model to learn reasoning patterns, leading to systematic generalization. We further explore the underlying mechanisms from a circuit perspective: (1) The data distribution (e.g., ratio λ and pattern) plays a crucial role in influencing the model's systematic generalization; (2) CoT training (with two-hop facts) internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Our findings elucidate the mechanisms underlying explicit CoT training and offer critical insights into tuning strategies for LLMs to achieve robust generalization.

Knowledge Triple을 사용한 CoT의 일반화 분석. 2-hop에서 CoT로 학습하면 OOD에 대한 일반화가 가능하고 3-hop 일반화는 되지 않지만 소규모 데이터로 3-hop에 대해 작동시키는 것이 가능하다는 이야기를 하고 있네요.

Analysis of CoT generalization using knowledge triples. The paper discusses that when trained on 2-hop reasoning with CoT, the model can generalize to OOD scenarios. While 3-hop generalization is not directly possible, they suggest that it can be achieved with a relatively small amount of data.

#reasoning

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

(Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur)

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

MoE Scaling Law. 이전에 나온 MoE Scaling Law와 맞춰보면 재미있습니다. (https://arxiv.org/abs/2410.05661) 이쪽에서는 고정 연산량에서 Expert의 수를 증가시켰을 때 Optimal Activated Weight와 토큰량에 집중하고 있네요. 여전히 MoE가 Activated Weight에 대한 Exponent가 크다는 것은 같습니다.

Scaling law for MoE. It's interesting to compare this with the previous paper on MoE scaling law (https://arxiv.org/abs/2410.05661). This paper focuses on the optimal activated weight and number of training tokens as the number of experts increases under a fixed computational budget. It still confirms that MoE has a larger exponent for optimal activated weight per training compute.

#scaling-law #moe