2024년 12월 18일

Dec 18, 2024

VidTok: A Versatile and Open-Source Video Tokenizer

(Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian)

Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.

비디오 토크나이저. 3D Conv를 상당 부분 2D/1D Conv로 대체한 구조. FSQ와 VAE에 대해서 비교한 Ablation들이 재미있네요.

A video tokenizer. Most 3D convs are replaced with 2D/1D convs. The ablation studies comparing FSQ and VAE are interesting.

#vq

Are Your LLMs Capable of Stable Reasoning?

(Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen)

The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

답을 생성할 수 있는가를 넘어 오답에 비해 정답이 얼마나 안정적으로 생성하는지를 측정하자는 아이디어.

The idea is to measure not only whether a model can generate correct answers, but also how consistently it produces correct answers compared to incorrect ones.

#benchmark #metric

SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

(Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds)

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they maintain additional moving average states throughout training, which results in memory requirements several times greater than the model. This overhead imposes constraints on scalability and computational efficiency. On the other hand, while stochastic gradient descent (SGD) is optimal in terms of memory efficiency, their capability in LLM training is limited (Zhao et al., 2024b). To address this dilemma, we show that pre-processing SGD is sufficient to reach Adam-level performance on LLMs. Specifically, we propose to preprocess the instantaneous stochastic gradients with two simple operators: GradNorm and GradWhitening. GradNorm stabilizes gradient distributions, and GradWhitening counteracts the local curvature of the loss landscape, respectively. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any accumulative state variables. Empirically, SWAN has the same memory footprint as SGD, achieving ≈ 50% reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates the same or even a substantial improvement over Adam. Specifically, when pre-training the LLaMa model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity in less than half tokens seen.

Optimizer State들을 모두 제거하고 대신 그래디언트에 대한 정규화와 화이트닝으로 대체한다는 아이디어. 화이트닝 연산의 비용이 문제일 텐데 여기서는 그럭저럭 괜찮다고 하는군요.

Optimizer State의 크기 자체가 문제가 되면서 오히려 Optimizer 연구가 더 많이 나오는 듯한 느낌도 드네요.

The idea is to remove all optimizer states and replace them with normalization and whitening of gradients. The computational cost of whitening could be a concern, but the paper claims it's manageable.

As the size of optimizer states themselves has become problematic, it seems that optimizer-related research is becoming more prevalent.

#optimizer

2024년 12월 18일

VidTok: A Versatile and Open-Source Video Tokenizer

Are Your LLMs Capable of Stable Reasoning?

SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Discussion about this post