2024년 11월 13일

Nov 13, 2024

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

(Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan)

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

DeepSeek은 이미지 인식과 생성을 분리하는 시도를 좀 더 한 모양이네요. (https://arxiv.org/abs/2410.13848) 인식과 생성을 분리한다면 사실 생성을 Autoregressive 모델로 할 필요가 없으니 Rectified Flow를 생성에 사용했습니다.

DeepSeek has made further attempts to separate image recognition and generation (https://arxiv.org/abs/2410.13848). If recognition and generation are separated, there's actually no need to perform generation in an autoregressive manner. So they used rectified flow for generation instead.

#image-generation #multimodal #flow-matching

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

(Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani)

Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a 256^3 signed distance field into a 12^3×4 latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at 256^3 resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.

https://autodeskailab.github.io/WaLaProject/

요즘 웨이블릿이 다시 인기가 생겼군요. SDF에 대해 Wavelet 공간에서 VQ를 학습하고 1B Latent Diffusion을 올린 모델.

Wavelets seem to be regaining popularity these days. This study trains VQ in the wavelet space of SDF and applies a 1B parameter latent diffusion on top of it.

#3d-generation #diffusion

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

(Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu)

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

CogVLM으로 생성한 캡션과 Alt text를 결합해 Recaptioning. 다만 CogVLM에 대해 차이가 아주 크다는 느낌은 아니긴 하네요.

Recaptioning by combining captions generated using CogVLM and alt text. However, it seems that the performance difference between this method and captions generated by CogVLM alone is maybe not very large.

#synthetic-data #image-text #captioning

2024년 11월 13일

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Discussion about this post