2024년 9월 30일

Sep 30, 2024

Emu3: Next-Token Prediction is All You Need

(Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang)

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Autoregressive Image/Video/Text Generation. 8x8x4 크기의 패치에 대한 32K Vocab의 VQ 사용. 효율성의 문제일 뿐 Discrete Token 혹은 Continuous Token의 사용 자체는 문제가 아닐지도 모르겠다는는 생각이 드네요.

Autoregressive generation of images, videos, and text. Uses VQ with a 32K vocabulary for 8x8x4 sized patches. Perhaps whether we use discrete or continuous tokens is not the crucial issue, but rather just a matter of efficiency.

#autoregressive-model #text-to-image #text-to-video

2024년 9월 30일

Emu3: Next-Token Prediction is All You Need

Discussion about this post