2024년 2월 1일
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
(Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi)
Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new ∞∞-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute ∞∞-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The ∞∞-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the ∞∞-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--∞∞-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.
이거 재밌네요. 무한대 n을 허용하는 n-gram LM입니다. 결과적으로는 데이터셋에 등장하는 최대 길이의 프롬프트 suffix의 등장 횟수를 세는 문제가 됩니다. 이 논문의 핵심은 이걸 효율적으로 처리할 수 있는 시스템을 구축한 것이네요.
LLM과의 비교도 재미있습니다. LLM은 확률이 낮다고 예측하지만 n-gram LM에서는 꽤 높은 경우들이 보이네요. 데이터셋 탐색이나 처리에 쓸 수 있는 도구가 될 수 있지 않을까 싶습니다.
#lm #dataset
Navigating the OverKill in Large Language Models
(Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin)
Large language models are meticulously aligned to be both helpful and harmless. However, recent research points to a potential overkill which means models may refuse to answer benign queries. In this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. Based on these insights, we introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon. We first extract such over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. Then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding. Empirical results indicate that our method has achieved an average reduction of the refusal rate by 20% while having almost no impact on safety.
프로세스를 kill 하라고 했을 때 응답을 거부하지 않는가는 요즘 LLM들의 통과 의례죠. 여기서는 맥락과는 별개로 개별 단어에 LLM이 너무 꽂히는 것이 문제인 것 같다고 보고, 그래서 Safety 프롬프트를 추가한 경우의 응답과의 Contrast를 통해서 이런 경향을 억제하는 방법을 테스트해봤습니다.
맥락이 아니라 개별 단어라는 것이 시사하듯 모델의 Capacity와 관련된 문제라는 생각이 많이 들긴 하네요. 물론 다양한 맥락 Instruction을 쓰는 것도 도움이 되겠지만요.
#safety
LongAlign: A Recipe for Long Context Alignment of Large Language Models
(Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li)
Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.
긴 Instruction 데이터에 대한 Instruction Tuning 실험 결과. 긴 Context를 요구하는 과제에는 긴 Instruction 데이터를 사용하는 것이 짧은 Context 과제들에 대한 성능 손실 없이 도움이 되고, 비슷한 길이끼리 묶어서 배치 혹은 패킹 + 길이에 따른 가중치 적용이 괜찮은 방법인 것 같다는 결과네요.
#instruction-tuning #long-context