2024년 6월 13일

Jun 13, 2024

An Empirical Study of Mamba-based Language Models

(Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro)

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

NVIDIA 쪽에서 Mamba와 Mamba-2 (https://arxiv.org/abs/2405.21060), 그리고 하이브리드 모델에 대한 학습과 평가를 진행했군요. Megatron-LM에도 코드가 릴리즈 됐다고 합니다.

Mamba 단독으로는 어렵다는 것을 재발견 했고 하이브리드 모델은 트랜스포머를 거의 따라잡거나 상회할 수 있다는 것을 확인했습니다. 다만 하이브리드 모델이 맥락에 민감하다는 것, 즉 맥락에 맞지 않는 텍스트들이 들어간 경우나 프롬프트의 포맷 등에 민감하다는 것을 발견했네요. 관계 없는 문서들을 이어붙여 시퀀스 길이를 채우는 일반적인 학습 방법이 SSM에는 안 맞을 수도 있을 것 같다는 이야기를 합니다. 상태 리셋 같은 것이 필요할지도 모르겠네요.

#state-space-model

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

(Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Tong Lu, Yali Wang, Limin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He, Jifeng Dai)

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

Interleaved Image-Text 데이터셋. 이미지 8.6B 텍스트 1.7T 분량이고 유튜브 영상까지 데이터소스로 포함되어 있네요.

데이터 전처리 파이프라인의 디테일 모두가 흥미롭습니다. Main Text Extraction에서 개선된 Trafilatura를 사용했다고 하는데 이 코드가 갖고 싶네요.

#dataset

Collective Constitutional AI: Aligning a Language Model with Public Input

(Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, Deep Ganguli)

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.

대중이 참여해서 Constitution을 구축하는 실험. 흥미로운 기획이네요. Anthropic이 설정한 Constitution과의 차이는 미묘한 것 같긴 합니다. 양적으로는 포착하기 어렵고 질적인 차이가 조금씩 발견된 것 같네요.

#safety #alignment

2024년 6월 13일

An Empirical Study of Mamba-based Language Models

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Collective Constitutional AI: Aligning a Language Model with Public Input

Discussion about this post