2025년 3월 18일

Mar 18, 2025

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

(Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, Michael Rubinstein, Michalis Raptis, Deqing Sun, Radu Soricut)

We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.

이미지 생성과 인식을 통합하는 모델인데 Janus처럼 (https://arxiv.org/abs/2410.13848) 인식과 생성을 위한 이미지 인코더를 분리했군요. 생성엔 Diffusion Head를 사용하네요.

The model unifies image generation and understanding, using separate image encoders for understanding and generation, similar to Janus (https://arxiv.org/abs/2410.13848). A diffusion head is employed for the generation task.

#image-generation #image-understanding #autoregressive-model #diffusion

SuperBPE: Space Travel for Language Models

(Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi)

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying only the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.

BPE 학습 과정에서 Pretokenization을 하지 않는 2단계를 추가하는 형태로 Super Word에 대한 토크나이저를 만들었네요.

The authors built a tokenizer with super words by adding a second stage to the BPE training process that doesn't use pretokenization.

#tokenizer

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

(Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen)

Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

LR Scheduling Scaling Law. 이전에 나왔던 Scaling Law보다 정확하다고 합니다. (https://arxiv.org/abs/2408.11029) 이를 기반으로 최적 스케줄을 추정했는데 WSD와 비슷한 형태군요.

LR scheduling scaling law. The authors claim it is more accurate than the previously published scaling law (https://arxiv.org/abs/2408.11029). Based on this, they estimated an optimal schedule, which turns out to be similar in shape to WSD.

#scaling-law #hyperparameter

Training Video Foundation Models with NVIDIA NeMo

(Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal)

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

비디오 파운데이션 모델을 위한 NeMo 파이프라인의 소개인데 여기서 주로 다루는 건 Video Diffusion이군요.

This is an introduction to the NeMo pipeline for video foundation models, with the main focus being on video diffusion.

#video-generation #efficiency

DAPO: an Open-Source LLM Reinforcement Learning System at Scale

(ByteDance Seed, Institute for AI Industry Research (AIR), Tsinghua University, The University of Hong Kong, SIA-Lab of Tsinghua AIR and ByteDance Seed)

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework a, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

바이트댄스의 LLM RL 알고리즘. GRPO에 KL Penalty 제외, Clipping 범위에서 상한과 하한을 분리, 정확도가 0 혹은 1인 샘플 제외, 샘플 길이에 따른 가중치 적용을 추가하고 그리고 Truncation이 일어난 샘플에 대한 페널티를 설계했네요.

앞으로도 온갖 알고리즘이 등장하겠죠. 다만 DPO 시리즈보다는 재미있네요.

ByteDance's LLM RL algorithm. Building upon GRPO, they've excluded the KL penalty, separated the upper and lower bounds of clipping ranges, excluded samples with 0 or 1 accuracy, and applied weighting based on sample length. Additionally, they've designed a penalty for truncated samples.

We can expect variety of algorithms to appear in the future. However, I find this more interesting than the DPO variants.

#reasoning #rl

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

(Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter)

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

xLSTM의 대규모 학습 결과. 결국 병렬화가 어려운 sLSTM은 빼고 mLSTM만 채택했군요.

Results of large-scale training with xLSTM. In the end, they opted to remove sLSTM, which is difficult to parallelize, and only adopted mLSTM.

#state-space-model

ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

(Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo)

Large language models (LLMs) have demonstrated enhanced performance through the Thinking then Responding paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.

다섯 가지의 사고 패턴을 디자인한 다음 그 패턴을 따르는 응답들을 생성해서 학습 결과를 평가해봤군요. 채팅 벤치 기준이고 각 패턴에 대해 생성할 수 있는 응답의 평균 퀄리티 등도 문제가 되겠습니다만, 특정한 구조가 없는 사고 패턴이 가장 유리하다는 것은 자연스럽네요.

The authors designed five thinking patterns, generated responses following these patterns, and evaluated the training results. While the study is based on chat benchmarks and the average quality of responses generated for each pattern could be an issue, it's natural that the unstructured thinking pattern proved to be the most advantageous.

#reasoning

Measuring In-Context Computation Complexity via Hidden State Prediction

(Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber)

Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric -- in contrast to the next token prediction loss -- correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

모델에서 일어나는 연산의 복잡성을 측정하려는 시도. 토큰 Perplexity는 그저 노이즈 때문에 높은 것일 수도 있으니 Next Hidden State에 대한 예측으로 측정하겠다는 시도입니다. Hidden State에 대한 예측이라고 하니 Test Time Train 연구들도 생각나네요. (https://arxiv.org/abs/2407.04620, https://arxiv.org/abs/2501.00663)

This paper attempts to measure the complexity of computations occurring within a model. Since token perplexity could be high simply due to noise, the authors propose estimating complexity by predicting the next hidden states instead. This approach of predicting hidden states reminds me of research on test-time training (https://arxiv.org/abs/2407.04620, https://arxiv.org/abs/2501.00663).

#mechanistic-interpretation

2025년 3월 18일

Discussion about this post