2025년 1월 31일

Jan 31, 2025

Diffusion Autoencoders are Scalable Image Tokenizers

(Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra)

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Flow Matching으로 학습한 인코더의 출력을 이미지 토큰으로 사용하는 방법. 굉장히 흥미롭네요. Diffusion이 이미지 인코더로서도 기능할 수 있다는 연구를 고려할 때 최근 제안되는 문제들인 이미지 토큰에 Semantic한 정보가 부족한 문제도 해소될 수 있지 않을까 싶습니다.

The method of using the output of an encoder trained with flow matching as image tokens. It is very interesting. Considering studies that show diffusion can be used as an image encoder, I think this could help resolve recent issues about image tokens are lacking semantic information.

#diffusion #tokenizer #image-generation

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

(Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu)

Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.

추론 모델들이 사고 과정에서 사고의 흐름을 자주 전환하는 현상이 나타나는데, 결과적으로는 맞는 방향이지만 충분히 생각하지 않고 전환하는 경우도 많다는 분석. 길이 페널티 등이 도움이 될지도 모르겠네요.

Analysis shows that current reasoning models frequently switch different lines of thought. While some of these thoughts are on the right track toward the correct answer, the models often switch to different thoughts without exploring them deeply enough. Length penalties might be helpful in addressing this issue.

#reasoning

Scaling Inference-Efficient Language Models

(Song Bian, Minghao Yan, Shivaram Venkataraman)

Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training a total of 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff.

추론 시점의 효율성을 위해 깊이 대 너비의 비율을 포함시킨 Scaling Law.

A scaling law that includes depth to width ratio to optimize inference time efficiency.

#efficiency #scaling-law

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

(Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han)

This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

Diffusion 모델의 깊이 증가를 통한 효율적 모델 학습과 샘플 수 증가를 통한 Inference time scaling.

Efficient training of diffusion models with depth upscaling, and inference time scaling with increasing number of samples.

#diffusion #efficient-training #inference-time-scaling

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

(Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham)

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into workers, where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

DiLoCo에서 그래디언트 전체를 동기화하는 것이 아니라 쪼갠 일부를 동기화하고, 연산과 중복시킨 시도. 본격적인 규모로 학습해서 동기화된 학습과 동등한 결과가 나오는지 관측되면 좋겠지만 그런 결과는 나와도 공개하지 않을지도 모르겠네요.

Improving DiLoCo by synchronizing only part of the gradient instead of the whole, and overlaps communication with computation. It would be great to observe if this achieves results equivalent to fully synchronized training at a significant scale, but such results might not be published even if they were obtained.

#efficient-training