2024년 5월 9일

May 09, 2024

What's up with Llama 3? Arena data analysis

(Lisa Dunlap, Evan Frick, Tianle Li, Issac Ong, Joseph E. Gonzalez, Wei-Lin Chiang)

LMSYS에서 챗봇 아레나에서 Llama 3의 승률 분석을 했군요. 예상할 수 있는 것처럼 문제가 어려워질수록 GPT-4, Claude 3 Opus, Gemini 1.5 Pro에 대해 승률이 떨어집니다. 반대로 Creative한 과제에 대해서는 승률이 높은 편이네요.

지금 챗봇이 무엇을 타겟해야 하는지에 대해서 생각해보게 하는군요. 좀 더 어려운 문제를 잘 푼다고 하더라도 그 문제 풀이 능력이 충분한 수준이 아니라면 그 능력이 큰 의미가 없을 수 있겠죠. 좀 더 Creative하거나 더 나은 스타일을 갖는 것이 더 바람직할 것입니다.

그렇지만 고난이도 작업에 대한 능력이 유의미해지고 유용하다면 이 능력에서의 격차가 굉장한 차이를 만들게 되겠죠. 할 수 있는가 혹은 할 수 없는가의 문제로 넘어가는 것이니까요.

사실 통제되지 않는 상황에서의 크라우드소싱 기반 평가 순위 자체는 한계가 있을 수밖에 없는데 그 부분에 대해서 어떻게 접근해야 하는지를 알려줬다는 느낌도 있습니다.

Llama 3가 특유의 스타일을 갖고 있고 평가에 이 스타일이 긍정적으로 작용했을 가능성도 있습니다. 부여하고자 하는 스타일 또한 챗봇에 중요한 요소인데 이 부분에 대한 관심이 부족하지 않았나 하는 생각을 합니다.

Introducing the Model Spec

(OpenAI)

https://cdn.openai.com/spec/model-spec-2024-05-08.html

OpenAI가 모델이 따라야 할 행동 지침을 기술한 스펙을 공개했네요. 이렇게 철저하게 안전성을 추구한다는 홍보 이상으로 RLHF를 위한 어노테이션 가이드라인의 일부로서 가치가 있는 문서인 듯 합니다. Preference 어노테이션 작업에 관심이 있다면 참조할만할 것 같네요.

You Only Cache Once: Decoder-Decoder Architectures for Language Models

(Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei)

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.

인코더-디코더가 이렇게 다시 등장하는군요. 인코더-디코더 구조에서 최종 레이어의 인코더 임베딩만 사용한다는 것을 일종의 Key/Value 캐시에 대한 압축처럼 사용했군요.

Causal Masking과 Efficient Attention을 사용해 디코더의 절반을 인코더처럼 임베딩을 출력하게 합니다. 이 임베딩에 대한 Cross Attention으로 나머지 디코더 절반을 구성해서 전체적으로는 하나의 디코더처럼 작동하게 만들었네요.

#efficiency

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

(Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet)

Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.

Emu (https://arxiv.org/abs/2309.15807) 에 대한 Distillation. 이전에도 지적되었던 학습과 샘플링 사이의 갭 문제에 대해서 태클하고 있네요. (https://arxiv.org/abs/2305.08891, https://arxiv.org/abs/2312.00210) 기본적으로 Student 모델이 디노이징한 결과를 Teacher 모델이 디노이징해서 타겟으로 사용하는 방식입니다. 추가적으로 Student 모델이 디노이징한 결과에 대해서 다시 노이즈를 걸어 Distillation 과정에서 중점적으로 학습할 시점을 옮길 수 있도록 했네요.

#efficiency #diffusion #distillation

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

(Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, Stefano Soatto)

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

Vision-Language 모델의 Open-ended Question/Generation 상황에 대한 할루시네이션 벤치마크 방법. 일단 Object Detection 같은 이미지 내에 포함된 객체 레이블이 있는 데이터셋에 대해 캡션을 생성한 다음 LLM으로 캡션에 이미지에 포함되어 있거나 포함되지 않은 객체가 등장하는지를 세는 방법이군요. 자연어에서 원자적 사실을 추출한 다음 각 사실을 점검해서 평가하는 방식들과 비슷할 듯 합니다.

#benchmark #hallucination