2024년 12월 9일

Dec 09, 2024

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

(Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang)

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

InternLM 쪽의 VLM. Packing 같은 이슈에까지 굉장히 상세하네요. 기본적으로는 큰 ViT를 사용하고 작은 LLM을 붙여 튜닝한 ViT를 큰 LLM과 결합해서 효율성을 높이고 반복 패턴 등의 데이터 이슈를 수정하는 방식이 주요 포인트입니다.

VLM from the InternLM team. The report is incredibly detailed, covering issues down to packing. The main points are using a large ViT, enhancing efficiency by tuning it with a small LLM before combining it with a large LLM, and addressing data issues such as repetitive patterns.

#multimodal #vision-language

CompCap: Improving Multimodal Large Language Models with Composite Captions

(Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He)

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.

차트나 스크린샷 같은 이미지에 대한 캡션 생성. 기존 VLM들이 잘 작동하지 않기에 데이터를 기반으로 이미지와 텍스트를 모두 생성하는 방법을 사용했군요.

Generation of captions for composite images such as charts or screenshots. As existing VLMs don't perform well on these types of data, they used a method to synthesize both images and text based on metadata.

#dataset #captioning #vision-language

Transformers Struggle to Learn to Search

(Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He)

Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers. However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.

트랜스포머가 그래프 탐색 문제를 학습할 수 있는지에 대한 분석. 조건에 따라 학습할 수 있고 일반화 가능한 알고리즘을 (각 노드에서 접근 가능한 노드들의 집합을 병합하는 방법) 학습할 수 있다고 하네요. 다만 그래프가 커질수록 학습이 어려워진다고. 다음 토큰 예측의 한계를 다룬 연구에서도 비슷한 문제를 태클했었죠. (https://arxiv.org/abs/2403.06963)

This paper analyzes whether transformers can learn to solve graph search problems. Under certain conditions, they can learn and generalize an algorithm (merging sets of nodes accessible from each node). However, learning becomes more difficult as the graph size increases. A recent study on the limitations of next token prediction addressed a similar issue. (https://arxiv.org/abs/2403.06963)

#transformer #mechanistic-interpretation

Transformers Can Navigate Mazes With Multi-Step Prediction

(Niklas Nolte, Ouail Kitouni, Adina Williams, Mike Rabbat, Mark Ibrahim)

Despite their remarkable success in language modeling, transformers trained to predict the next token in a sequence struggle with long-term planning. This limitation is particularly evident in tasks requiring foresight to plan multiple steps ahead such as maze navigation. The standard next single token prediction objective, however, offers no explicit mechanism to predict multiple steps ahead - or revisit the path taken so far. Consequently, in this work we study whether explicitly predicting multiple steps ahead (and backwards) can improve transformers' maze navigation. We train parameter-matched transformers from scratch, under identical settings, to navigate mazes of varying types and sizes with standard next token prediction and MLM-U, an objective explicitly predicting multiple steps ahead and backwards. We find that MLM-U considerably improves transformers' ability to navigate mazes compared to standard next token prediction across maze types and complexities. We also find MLM-U training is 4x more sample efficient and converges 2x faster in terms of GPU training hours relative to next token training. Finally, for more complex mazes we find MLM-U benefits from scaling to larger transformers. Remarkably, we find transformers trained with MLM-U outperform larger transformers trained with next token prediction using additional supervision from A* search traces. We hope these findings underscore the promise of learning objectives to advance transformers' capacity for long-term planning.

Next Token Prediction 대신 MLM으로 앞뒤 N개 토큰 예측으로 학습시킬 때 미로 탐색 과제를 학습시킬 수 있다는 연구. 위의 연구와 비슷하군요. 추가적으로 RoPE의 정밀도 문제에 대해서도 언급하고 있네요. (https://arxiv.org/abs/2411.13476)

This research shows that transformers can learn maze exploration tasks when trained with MLM to predict N tokens forward or backward, instead of using next token prediction. This approach is similar to the study mentioned above. Additionally, they discuss the numerical precision problem of RoPE. (https://arxiv.org/abs/2411.13476)

#transformer #autoregressive-model

2024년 12월 9일

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

CompCap: Improving Multimodal Large Language Models with Composite Captions

Transformers Struggle to Learn to Search

Transformers Can Navigate Mazes With Multi-Step Prediction

Discussion about this post