2024년 9월 27일

Sep 27, 2024

MIO: A Foundation Model on Multimodal Tokens

(Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang)

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

이미지 및 음성에 대한 Autoregressive 생성 모델. 두 모달리티 모두 Discrete Token을 사용했습니다.

이 계통 연구도 이제 꽤 많이 나오고 있네요. 여전히 Discrete Token으로 충분할지 Continuouse Token으로 가야할지, Autoregressive vs Diffusion, Diffusion을 사용한다면 Autoregressive 모델과 어떻게 결합할지 등등 분명하지 않은 것은 많긴 합니다.

#image-generation #audio-generation #multimodal #vq #autoregressive-model

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

(Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu)

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

여기도 Speech Generation을 붙였군요. 여기서는 이미지 생성은 빼고 Continuous Token을 사용했습니다. 이쪽에서 가장 흥미로운 부분은 Vision-Text Alignment와 Speech-Text Alignment가 순차적으로 학습시키는 경우에도 상승 효과가 있었다는 점일 것 같네요. 모달리티 사이의 상승 효과는 다들 실제로 일어나기를 기대하던 것이긴 하죠.

ChatGPT Advanced Voice가 풀린 시점이라 이 계통 연구에 대한 관심이 더 높아졌을 것 같네요. 그런데 Advanced Voice가 실제 사용자에게 어떻게 활용되고 있을지 궁금하군요.

#speech #multimodal

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

(Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, Ying-Cong Chen)

Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.

Diffusion을 이미지 과제에 대해 사용하려는 시도는 간간히 나오네요. x 예측을 Objective로 사용하고 스케줄의 길이를 줄인 것, 이미지 -> 레이블과 레이블 -> 이미지를 동시에 학습시킨 것이 주요한 차이군요.

#diffusion

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

(Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang)

Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM's learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \url{https://github.com/NVlabs/MaskLLM}.

2:4 같은 Structured Sparsity를 위한 마스크를 학습하기 위한 방법. Gumbel Softmax 트릭을 사용해 각 Weight에 사용할 마스크를 학습시키는 방법이군요. 양자화와 관련해서 Quantization Aware Training이 성능 문제에 대한 최선의 방법일 수 있는 것처럼 Sparsity도 Sparsity Aware Training이 필요하지 않을까 하는 생각도 듭니다만...그것도 꽤 어려운 문제겠네요.

#sparsity

2024년 9월 27일

MIO: A Foundation Model on Multimodal Tokens

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Discussion about this post