2024년 12월 16일

Dec 16, 2024

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan)

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision- Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demon- strates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2- Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek- VL2 achieves competitive or state-of-the-art performance with similar or fewer activated param- eters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

DeepSeek의 Vision Language 모델이 새로 나왔군요. 아키텍처는 DeepSeek 스타일의 MoE + MLA인데 Aux-loss Free Load Balancing을 (https://arxiv.org/abs/2408.15664) 사용했군요. 흥미로운 방법이라고 생각했는데 DeepSeek 내부에서도 신뢰하고 있는 모양입니다.

데이터셋도 구성을 꽤 바꿨고 Grounding 같은 과제를 추가하기도 했네요. 다만 생각보다 오픈소스 데이터를 꽤 사용하는 편이군요.

DeepSeek has released a new vision language model. The architecture uses DeepSeek-style MoE + MLA, but they have applied aux-loss free load balancing (https://arxiv.org/abs/2408.15664). I thought this was an interesting method, and it seems that DeepSeek also found it useful internally.

The composition of the dataset has been significantly changed, and tasks like grounding have been added. They are relying on open-source datasets quite extensively.

#vision-language #moe

Apollo: An Exploration of Video Understanding in Large Multimodal Models

(Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia)

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

Video-Language 모델의 디자인에 대한 연구. 비디오 샘플링 방법 등 수많은 노브들이 있는데 이를 모두 대규모로 실험하기는 어렵죠. 그래서 어느 정도 크기의 모델과 데이터에 대해 실험을 했을 때의 결과가 Scaling을 한 이후의 결과와 상관관계가 높은지를 분석했습니다. 상당히 좋은 방법이네요.

This paper researches design choices for video-language models. There are many knobs to consider, such as video sampling methods, but it's difficult to conduct large-scale experiments for all of them. Instead, the authors analyzed the correlation between results from experiments using smaller models and datasets and those after scaling up and decided minimum scale of experiments they will use. I think this is quite a clever approach.

#video-language #multimodal #scaling-law

Memory Layers at Scale

(Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, Gargi Gosh)

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

Sparse Key-Value Memory 계통 연구도 꽤 나오는군요. (https://arxiv.org/abs/2407.04153, https://arxiv.org/abs/2411.12364) 마찬가지로 Product Key Memory를 사용했네요. 지식을 따로 저장한다는 것은 흥미롭긴 하지만 VQ 쪽에서 보고되는 Sparse Update로 인한 문제를 생각하면 좀 까다로울 것 같습니다. (https://arxiv.org/abs/2411.02038, https://arxiv.org/abs/2412.02692)

There seems to be a growing number of studies related to sparse key-value memory. (https://arxiv.org/abs/2407.04153, https://arxiv.org/abs/2411.12364) This study also employed product key memory. While the idea of storing knowledge separately is interesting, it might be challenging when considering the problems arising from sparse updates, which are frequently reported in VQ research. (https://arxiv.org/abs/2411.02038, https://arxiv.org/abs/2412.02692)

#moe #sparsity

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

(Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou)

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.

ASR 학습으로 추출한 FSQ 토큰을 시맨틱 토큰으로 사용해서 텍스트-음성에 대한 Autoregressive 학습을 하고, Mel Spectrogram은 레퍼런스 음성과 Flow Matching을 사용해 생성하는 방법. 논문에서 밝히고 있는 것처럼 텍스트 입력을 통해 생성되는 음성의 스타일을 지정할 수는 없는 구조네요.

실시간 출력이 되어야만 한다는 제약 조건이 모델링 선택을 흥미롭게 만드는 것 같습니다. 이미지 쪽에서도 Autoregressive 생성에 대한 진전이 많이 있는데, 결합해서 생각하면 재미있을 것 같네요.

This method uses FSQ tokens extracted through ASR learning as semantic tokens for autoregressive training on text-to-speech, and generates mel spectrograms using reference speech and flow matching. As the authors state in the paper, this structure does not allow for specifying the style of generated speech through text input.

The requirement for real-time output seems to make the modeling choices particularly interesting. There has been significant progress in autoregressive generation in the image domain as well, and it would be fascinating to consider combining these approaches with speech generation.

#autoregressive-model #text-to-speech #vq #flow-matching

2024년 12월 16일

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Memory Layers at Scale

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Discussion about this post