2024년 5월 2일

May 02, 2024

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

(Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Mikel Artetxe, Yi Tay)

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, and complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, our hard set contains > 50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We have made API access free-of-charge for the purpose of lightweight evaluation with the intention to run formal human evaluations for public models that do well on the Vibe-Eval automatic scores. We release the evaluation code and data at github.com/reka-ai/reka-vibe-eval.

Reka의 고난이도 Vision-Language 벤치마크. Inverse Scaling이 나타나는 샘플도 발견했네요.

모델 성능이 높아지면서 고난이도이면서 고품질이고 타당한 벤치마크에 대한 필요가 증가하고 있습니다. 그러나 이런 벤치마크를 만드는 것은 그만큼 더 까다롭겠죠. 요즘 흥미로운 사례인 GPQA에 대해서도 이런저런 말이 있더군요. (https://x.com/typedfemale/status/1783951432590188916)

#benchmark

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

(Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike)Lunati, Summer Yue)

Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.

GSM8K와 유사한 데이터를 1250개 새로 만들어서 테스트. 그룹이 극적으로 갈리네요. 굳이 오염이 아니더라도 비슷한 데이터가 들어갔을 수 있다는 추측을 하고 있는데 굉장히 유사한 데이터를 구축했다는 것을 고려하면 이쪽도 설명이 충분하지는 않다는 느낌이네요.

#benchmark #contamination

A Primer on the Inner Workings of Transformer-based Language Models

(Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà)

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

트랜스포머 모델 분석에 대한 리뷰군요. Mechanistic Interpretability를 위시해서 Interpretability 문제가 지금처럼 인기 있었던 적이 없는 것 같네요.

#transformer

2024년 5월 2일

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

A Primer on the Inner Workings of Transformer-based Language Models

Discussion about this post