2024년 11월 18일
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
(Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie)
There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into intra-scale modeling, which captures local spatial dependencies within each scale, and inter-scale modeling, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256×256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at https://github.com/OliverRensu/MVAR.
Visual Autoregression에서 (https://arxiv.org/abs/2404.02905) 한 Scale 내의 모델링과 Scale 사이의 모델링을 분리해 한 Scale 내에서는 Self Attention을 사용하고 Scale 사이에서는 Mamba를 사용해서 효율화를 도모했군요.
Decoupling intra-scale and inter-scale modeling in Visual Autoregression (https://arxiv.org/abs/2404.02905). For intra-scale modeling they use self attention, and for inter-scale modeling, they employ Mamba to improve efficiency.
#image-generation #autoregressive-model
MARS: Unleashing the Power of Variance Reduction for Training Large Models
(Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu)
Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
SVRG 같은 Variance Reduction 테크닉과 Preconditioning의 결합이군요.
Combining preconditioning and variance reduction techniques like SVRG.
#optimizer
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
(Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai)
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.
VLM에 대한 Preference Tuning인데 Preference 데이터셋을 정답이 있는 사례에 대해서는 정답 여부로, 정답이 딱히 없는 경우에는 이미지를 주고 생성한 결과를 Positive, 그리고 이미지로 생성한 결과의 일부를 가져와 이미지 없이 나머지를 생성한 결과를 Negative로 세팅하는 식으로 구축했군요.
This paper discusses preference tuning for VLMs. They built a preference dataset using 2 approaches: for tasks with easily verifiable answers, they used correctness for preference. For tasks without definitive solutions, they treated image-based generations as positive and partial text completions using image-based generations as a prefix without images as negative examples.
#preference #vision-language