2024년 1월 10일

Jan 10, 2025

An Empirical Study of Autoregressive Pre-training from Videos

(Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik)

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

VQ (dVAE) + Llama 기반 Autoregressive Video Model.

An autoregressive video model based on VQ (dVAE) and Llama.

#video #vq #autoregressive-model #scaling-law

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

(Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng)

The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

멀티모달 추론 벤치마크. 멀티모달 추론이 돌파되는 것도 하나의 중요한 지점이 될 것 같네요.

A multimodal reasoning benchmark. The breakthrough in multimodal reasoning could be a significant milestone.

#multimodal #reasoning #benchmark

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

(Aniruddha Mahapatra, Long Mai, Yitian Zhang, David Bourgin, Feng Liu)

Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.

비디오 토크나이저의 시간축의 압축률을 높이기 위한 방법. 16x 압축은 그냥은 안 되지만 4x 압축 모델로 4x 서브샘플링한 프레임을 압축하는 것은 된다는 것에서 시작했네요. 그러니까 시간축이 증가하면서 움직임의 양이 늘어나는 것 자체는 문제가 아닐 수 있다는 것입니다. 그래서 서브샘플링한 비디오를 인코딩한 결과를 추가 입력으로 사용해 가이드를 주는 방법을 생각했군요.

A method for increasing the compression rate along the temporal axis of video tokenizers. It starts from the observation that 16x compression from scratch is not feasible, but it is possible to compress 4x subsampled frames with 4x compression models. This suggests that the increase in motion with expanding temporal axis itself may not be problematic. So they devised a method to provide guidance by using encoded features of subsampled video as additional input.

#video #tokenizer

2024년 1월 10일

An Empirical Study of Autoregressive Pre-training from Videos

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Discussion about this post