2024년 12월 12일

Dec 12, 2024

Multimodal Latent Language Modeling with Next-Token Diffusion

(Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei)

Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop 𝜎σ-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

VAE로 Continuous Token으로 변환한 다음 Diffusion Head로 Autoregressive Generation. Transfusion (https://arxiv.org/abs/2408.11039) 같은 계통의 아이디어인데 (https://arxiv.org/abs/2406.11838) 가장 단순한 형태의 Next Token Prediction을 사용했다고 할 수 있겠네요.

추론 속도의 특성 등이 좀 더 궁금하긴 한데 여튼 이쪽도 좋은 방향입니다.

The paper proposes autoregressive generation using a diffusion head to predict continuous tokens extracted by VAE. This idea is in line with Transfusion (https://arxiv.org/abs/2408.11039) and similar approaches (https://arxiv.org/abs/2406.11838), but this study employs the simplest form of next token prediction.

I'm curious about the characteristics of inference performance, but overall, this seems to be a promising approach.

#autoregressive-model #vae #diffusion

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

(Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun)

Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.

Text to Image 모델에서 여러 객체들과 그 관계를 제대로 이미지에 반영하지 못하는 문제를 해소하기 위해 Scene Graph 데이터셋을 생성하고, Scene Graph를 입력으로 받는 모델을 디자인. 최근에 Scene Graph를 사용해 Instruction 데이터를 구축하는 시도도 있었죠. (https://arxiv.org/abs/2412.07012) Scene Graph를 잘 만드는 것이 문제이긴 하지만 흥미로운 시도가 아닌가 싶습니다.

To address the issue of text-to-image models failing to accurately reflect multiple objects and their relationships in generated images, this study created a scene graph dataset and designed a model that accepts scene graphs as input. Recently, there was also an attempt to construct instruction data using scene graphs (https://arxiv.org/abs/2412.07012). Although creating effective scene graphs would be a challenge, I think this is an interesting approach.

#text-to-image #synthetic-data

Forking Paths in Neural Text Generation

(Eric Bigelow, Ari Holtzman, Hidenori Tanaka, Tomer Ullman)

Estimating uncertainty in Large Language Models (LLMs) is important for properly evaluating LLMs, and ensuring safety for users. However, prior approaches to uncertainty estimation focus on the final answer in generated text, ignoring intermediate steps that might dramatically impact the outcome. We hypothesize that there exist key forking tokens, such that re-sampling the system at those specific tokens, but not others, leads to very different outcomes. To test this empirically, we develop a novel approach to representing uncertainty dynamics across individual tokens of text generation, and applying statistical models to test our hypothesis. Our approach is highly flexible: it can be applied to any dataset and any LLM, without fine tuning or accessing model weights. We use our method to analyze LLM responses on 7 different tasks across 4 domains, spanning a wide range of typical use cases. We find many examples of forking tokens, including surprising ones such as punctuation marks, suggesting that LLMs are often just a single token away from saying something very different.

텍스트 생성 과정에서 다른 토큰으로 대체된다면 생성 결과를 크게 바꿀 수 있는 토큰이 있다는 가설. 각 토큰에 대해 다른 토큰을 샘플링해서 텍스틀 생성한 다음 생성 결과의 차이를 측정합니다. 그리고 시퀀스 내에서 이 차이를 크게 유발하는 지점이 있는지를 Change Point Detection과 생존 분석으로 분석했네요. 얼마 전에도 추론에 결정적인 토큰이 존재한다는 연구가 있었죠. (https://arxiv.org/abs/2411.19943) 재미있는 지점이네요.

This study hypothesizes that there are certain tokens which, if replaced by other tokens, can significantly alter the outcome of text generation. The researchers measured differences in generation outcomes by sampling alternative tokens for each token in a sequence and generating the rest of the text. They then analyzed whether there are points in the sequence that cause large outcome differences using change point detection and survival analysis. Recently, there was another study suggesting the existence of critical tokens that affect reasoning outcomes (https://arxiv.org/abs/2411.19943). I think this is an interesting observation.

#llm

2024년 12월 12일

Multimodal Latent Language Modeling with Next-Token Diffusion

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Forking Paths in Neural Text Generation

Discussion about this post