2025년 1월 15일
MiniMax-01: Scaling Foundation Models with Lightning Attention
(MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu)
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
MiniMax의 46B Activated 456B Weight MoE, 1M Context Length 모델. 학습은 11T 토큰으로 한 것 같네요. NIAH 실험에서는 4M Context Length까지 Extrapolation을 해봤네요.
아키텍처 측면에서 굉장히 흥미롭습니다. Linear Attention과 Softmax Attention의 하이브리드입니다. 그리고 Post Norm을 (!) 사용했네요. (https://arxiv.org/abs/2203.00555) 물론 Long Context 학습과 추론을 위해 많은 최적화를 했습니다. 하이브리드 모델로 최고 수준의 성능을 달성할 수 있다는 사례를 만들어냈네요.
데이터 실험 측면에서도 Power Analysis로 효과를 신뢰롭게 검증하기 위한 실험의 크기를 규명하는 등 흥미로운 점이 많습니다.
MiniMax's LLM of 46B activated 456B total weight MoE, with a 1M token context length. It appears to have been trained on approximately 11T tokens. In NIAH experiments, they successfully extrapolated to a 4M token context length.
From an architectural standpoint, it's extremely interesting. The model uses a hybrid of linear and softmax attention. Notably, they've employed post norm (!) (https://arxiv.org/abs/2203.00555). They've also implemented numerous optimizations for long-context training and inference. This work demonstrates that it's possible to achieve state-of-the-art performance using hybrid models.
The data experimentation aspect is equally intriguing. They've utilized power analysis to determine the necessary scale of experiments to reliably verify the effects of data variations.
#llm #state-space-model #efficient-attention #moe