2024년 3월 28일

Mar 28, 2024

Databricks의 오픈 SoTA MoE 모델.

Active 파라미터 36B, 총 파라미터 132B이며 12T 토큰 학습. 6144 dim / 40 레이어, 16 Expert Top-4 모델. cl100k_base 토크나이저 사용. 32K Context Length. 데이터적 측면에서도 공을 들였다고 합니다. 7B 모델에 대해 테스트해봤을 때 같은 성능에 도달하기 위한 학습량을 절반으로 줄일 수 있었다고.

벤치마크는 과연 오픈된 모델들 중에서는 상위권이긴 합니다. 다만 MoE 모델이 어느 크기의 Dense 모델과 상응하는가를 아직 알지 못해서 학습 가성비가 어느 정도인지는 알기가 어렵네요. 다만 36B/132B 12T라는 적지 않은 학습 규모는 LLM을 잘 학습시키는 것이 쉽지 않다는 것을 다시 확인해주고 있는 것이 아닐까 싶습니다.

점점 더 레시피를 모르면 경쟁력 있는 모델을 만들기 어렵다는 것이 드러나고 있는 것 같네요. 그런데 그 레시피가 무엇인지 감이 잡히질 않습니다. 데이터라면 데이터에서의 어떤 차이일까요?

#llm

Long-form factuality in large language models

(Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le)

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

생성된 긴 답변의 정확성을 평가하기 위한 방법. LLM으로 답변을 개별 사실로 쪼갠 다음 구글 검색을 용해 개별 사실을 검증하는 LLM 에이전트를 활용해 평가하는 흐름입니다. 사람이 그냥 하는 것보다 정확할 수 있다고 하는군요.

큰 모델이 정확하다는 결론 이상으로 답변에 포함된 근거 있는 사실, 근거 없는 사실, 관계 없는 사실들의 절대적인 숫자 자체도 의미가 있을 수 있겠다 싶습니다. 답변의 스타일을 말해줄 수 있으니까요.

#benchmark

Mechanisms of non-factual hallucinations in language models

(Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong)

State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. Despite extensive efforts to detect and mitigate hallucinations, understanding their internal mechanisms remains elusive. Our study investigates the mechanistic causes of hallucination, specifically non-factual ones where the LM incorrectly predicts object attributes in response to subject-relation queries. With causal mediation analysis and embedding space projection, we identify two general mechanistic causes of hallucinations shared across LMs of various scales and designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and 2) failing to select the correct object attribute in upper layer attention heads and MLPs. These two mechanisms exhibit varying degrees of subject-object association, predictive uncertainty and perturbation robustness. Additionally, we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics for the two mechanistic causes of hallucinations. We also highlight how attribution features from our causal analysis can effectively construct hallucination detectors. Our work proposes a mechanistic understanding of LM factual errors.

할루시네이션이 발생하는 메커니즘을 분석해보니 모델의 레이어 초반에서 발생하는 할루시네이션과 후반에서 발생하는 할루시네이션이 다르다는 결과. 모델 초반의 MLP에서 발생하는 할루시네이션은 모델에 정보가 없어서이고 따라서 이런 할루시네이션은 전혀 엉뚱한 답을 만듭니다. 모델 후반의 Attention과 MLP에서 발생하는 할루시네이션은 모델 초반에서 인출된 사실들 중에서 맥락과 관련된 사실을 골라내지 못하기 때문에 발생합니다. 따라서 이런 할루시네이션은 맥락과 관련이 있는 답들이 나온다는 분석입니다. 재미있네요.

#hallucination

2024년 3월 28일

Databricks DBRX

Long-form factuality in large language models

Mechanisms of non-factual hallucinations in language models

Discussion about this post