2024년 1월 25일
Fuyu-Heavy
Fuyu-8B의 다음 버전 모델이 나왔군요. 멀티모달 벤치마크에 대해서 대략 Gemini Pro와 비슷한 스코어를 찍었네요. 모델 디테일에 대한 설명은 없긴 합니다.
#llm #vision-language
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
(Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng)
Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design.
https://con-textual.github.io/
텍스트가 많이 포함된 이미지를 골라 텍스트만으로는 대답할 수 없는 문제로 구성한 벤치마크입니다. OCR로 텍스트만 추출할 수 있으면 풀 수 있는 것이 아니라 이미지 내 다른 구성 요소와의 관계를 고려하는 것이 중요한 사례들이라고 할 수 있겠네요.
모델에게 꽤 난이도가 있는 것으로 보이긴 합니다. 프롬프팅으로 푸는 사례들도 나오겠지만요.
#benchmark #vision-language #ocr
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
(Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki)
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
문서 관련 데이터셋들을 모아 정리해서 데이터셋을 구축했네요. OCR 결과를 입력으로 받는 모델을 만들어서 실험했습니다.
#dataset #ocr
ChatterBox: Multi-round Multimodal Referring and Grounding
(Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye)
In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.
멀티 턴 Visual Grounding과 Referring에 대한 데이터셋. Visual Genome 데이터셋의 BBox와 Relationship, 객체의 속성을 기반으로 GPT-4로 질문과 답을 만들고, 다음 질문이 이전 질문을 기반으로 하도록 해서 만들었습니다.
기존에 구축되어 있는 데이터셋들이 꽤 있으니 그걸 채택하거나 변형해서 새로운 과제를 만들어 학습시키는 것으로 Vision-Language와 관련된 능력을 향상시키는 것도 꽤 좋은 방향일 듯 싶습니다.
#dataset #vision-language #instruction-tuning #visual_grounding #benchmark
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
(Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried)
Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.
웹사이트를 대상으로 한 Agent로서의 능력을 평가하는 벤치마크. 웹사이트가 변화하면 안 되니 웹사이트를 독립적으로 만들어서 띄울 수 있게 했군요.
웹이나 모바일 앱을 대상으로 하는 Agent 모델들에 대한 관심이 꽤 있는 것 같습니다. 얼마나 실용적으로 유용할지는 잘 모르겠지만 (https://x.com/astralwave/status/1744968755565023437) 모델의 능력을 테스트하기에는 흥미롭지 않나 싶습니다.
#benchmark #agent
MambaByte: Token-free Selective State Space Model
(Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, Alexander M Rush)
Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling.
요새 Mamba를 이미지 등에 테스트해본 결과들이 많이 나오고 있죠. 이쪽은 바이트 레벨 LM을 해본 결과이네요. 스코어가 꽤 좋습니다.
더 큰 규모에서, Associative Recall 등과 여러 벤치마크들에 대한 테스트가 필요하겠지만 스코어가 잘 나오는 사례가 있다는 것 자체가 Mamba/SSM이 갖는 뭔가 좋은 특징이 있을 수 있다는 것을 시사하는 것이 아닐까 싶습니다. 더 흥미로운 결과들이 기다려지네요.
#state-space-model #lm
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection
(Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar)
Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial �τ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.
구글은 T5를 참 좋아한다 싶네요. Span Corruption에 ELECTRA스럽게 마스킹된 토큰 예측을 붙여서 인코더는 Generator가 생성한 토큰을 검출하고 디코더는 Span을 예측하게 했네요. 까다롭게도 학습 중에서 이 Objective를 계속 사용할 수는 없고 초기에만 사용한 다음 Span Corrpution으로 나머지 학습을 진행해야 합니다.
#mlm
Machine Learning Engineering
Stas Bekman (https://stasosphere.com/machine-learning/) 의 ML 엔지니어링 자료. 살펴볼 만하네요.
#engineering