2025년 3월 13일
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space
(Yifan Zhou, Zeqi Xiao, Shuai Yang, Xingang Pan)
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation. Code is available at: https://github.com/SingleZombie/AFLDM
구상하고 있었던 주제였는데 결과가 나와버렸군요. Antialiased Latent Diffusion. 모델을 Antialiased 아키텍처로 수정하고 Equivariance Loss를 추가했네요.
The results for a problem I've been contemplating have already come out. Antialiased latent diffusion. They modified the model architecture to make it antialiased and added an equivariance loss.
#diffusion
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
(Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov)
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/
Autoregression과 Diffusion이 부각되면 자연스럽게 블럭 단위의 Autoregression/Diffusion이 등장하기 마련이죠. Diffusion이 보통 관심을 갖는 이유은 Left-to-Right 순서의 제약을 해결해주지 못하게 되긴 합니다만.
When autoregression and diffusion got a spotlight, naturally block level autoregression/diffusion would appear. But it cannot address the problem of left-to-right ordering which is the reason why diffusion gain a traction.
#autoregressive-model #diffusion
Cost-Optimal Grouped-Query Attention for Long-Context LLMs
(Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun)
Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.
Attention 헤드의 수를 임베딩 차원과 분리해서 Attention 헤드의 수와 모델 크기 사이에서 최적 지점을 찾으려는 시도. 실제로 GQA의 크기, Window Attention의 비율 등은 이미 중요한 문제가 된 것 같습니다.
An attempt to find the optimal balance between the number of attention heads and model size by decoupling the number of heads from embedding dimensions. In fact, the size of GQA and the ratio of window attention have already become important considerations.
#scaling-law #transformer
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
(Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, Jiawei Han)
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
R1-Searcher에 이어 Search-R1이 나왔군요. (https://arxiv.org/abs/2503.05592) RL을 통해 에이전트를 만드는 자연스러운 순서겠죠. 그런데 단답을 넘어 장문의 리포트를 쓸 수 있게 하는 디자인이 중요한 다음 단계겠네요.
Following R1-Searcher, Search-R1 has been released (https://arxiv.org/abs/2503.05592). It's a natural progression to build agents using RL. However, an important next step would be designing systems capable of generating long-form reports beyond just short answers.
#rl #agent