2024년 9월 9일
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
(Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan)
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., 218218 codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet 256×256256×256. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
Large Vocabulary Vector Quantization을 적용한 Autoregressive Image Generation 모델들이 드디어 등장했군요. Look-up Free Quantization을 사용해 Vocabulary를 2^18 = 262K로 증가시켰습니다.
다만 262K Vocabulary를 직접 모델링한 것은 아니고 Vocabulary를 Factorize해서 쪼갠 다음 입력으로는 이 Factorize된 Vocabulary의 합을 사용하고 출력 시에 Factorize된 Vocabulary들을 순차적으로 생성하는 방식을 사용했네요.
#vq #autoregressive-model #image-generation
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
(Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb)
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.
Attention에서 Softmax 대신 Sigmoid를 사용한 트랜스포머. 왜 Sigmoid를 사용해야 하는가라고 하면 FlashAttention 같은 최적화를 했을 때 좀 더 빠르고 Lipschitz constant의 상한이 Softmax Attention보다 작다는 이야기를 합니다. 다만 그보다는 Sigmoid Attention으로 Softmax Attention의 성능을 낼 수 있는 세팅을 찾았다가 요점인 것 같긴 하네요.
#transformer
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
(Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu)
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
이쪽도 Autoregressive Image Generation과 Recognition을 태클했군요. 다만 Vision Encoder에서 Image-Text Alignment와 Reconstruction을 모두 할 수 있게 하는 쪽에 방점이 찍혀 있습니다. 그냥은 안 되고 CLIP을 가져와서 Reconstruction을 추가 학습시키는 방향으로 해야 했다고 하네요.
Quantization에 RQ-VAE를 사용해서 여기에서도 깊이에 따라 순차적인 생성을 하게 됩니다.
#clip #autoregressive-model #image-generation
MiniCPM3-4B
(OpenBNB)
MiniCPM3가 나왔습니다. 특징적으로는 굉장히 깊고 (62 레이어입니다) MLA를 사용했다는 것이네요. (https://arxiv.org/abs/2405.04434)
MLA는 GQA보다 가벼우면서 동시에 고성능이라는 점에서 매력적인 선택지인 듯 합니다. 다만 Positional Encoding 차원에 제약이 걸린다는 것이 아쉬운 부분이죠. 특히 Long Context에 대해서 Gemma 스타일처럼 256 차원 정도의 큰 차원을 사용하는 것이 도움이 되지 않을까 하는 생각이 있습니다.
#llm