2024년 3월 27일
InternLM2 Technical Report
(Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin)
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
InternLM2의 테크니컬 리포트. 프리트레이닝 섹션이 비교적 상세합니다. 특히 32K Context를 고려한 데이터 구성이 좀 나와있군요. (코드 데이터에 대한 Dependency Sorting 전략 등.)
정렬 측면에서도 이것저것 정보가 꽤 있습니다. SFT는 10M 규모로 했군요. 시스템 프롬프트를 사용한 Conditional Reward Model, Focal Loss에 영감을 받은 Ranking Loss와 Reward Score에 대한 페널티, 그리고 세 사이클로 나눠 진행한 Online RLHF 등이 특징이군요. Online RLHF에서는 리워드 해킹이 발생하면 그 해킹을 막기 위한 샘플을 빠르게 추가하는 경로와 장기적으로 품질을 향상시키기 위한 경로 두 가지로 나눠 진행했다고 합니다.
평가 부분에서는 이상한 게 20B와 7B 모델의 성능이 비슷하거나 아예 수치가 같은 경우가 있다는 것이네요. 수치가 아예 같은 것은 실수일 것 같지만 성능이 비슷하게 나온 것은 어떤 문제에서일지 궁금합니다.
#pretraining #alignment
The Unreasonable Ineffectiveness of the Deeper Layers
(Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts)
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
레이어의 입출력 유사도를 측정해서 유사도가 높은 레이어를 Pruning 하고 QLoRA로 레이어 Pruning의 영향을 감소시켜본 시도. 얼마 전에도 Self Attention을 타겟해서 Redundant한 레이어를 Pruing하고 Adapter를 사용해 Pruning의 효과를 상쇄하려는 시도가 나왔습니다. (https://arxiv.org/abs/2403.15226)
사실 레이어의 입출력이 비슷하다는 것은 Undertrain 됐다는 의미일 수도 있겠죠. (https://huggingface.co/blog/lorinma/yi-9b-divedeep) 따라서 학습이 더 진행된다면 Pruning이 어려워지는 효과도 있을 수 있지 않을까 싶긴 합니다만 일단 그래프에서는 Mistral에서도 그럭저럭 되는 것 같네요. Overtraining을 한다고 하더라도 Redundant한 레이어가 없어지지는 않을 것 같긴 합니다.
#pruning
Mechanistic Design and Scaling of Hybrid Architectures
(Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli)
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology.
State Space Model과 Transformer의 하이브리드 모델들에 대한 Scaling Law. Attention을 25% 정도 섞은 모델이 최적이었다고 합니다. 더 나아가 하이브리드 모델이 더 작은 모델을 더 오래 학습하는 측면에서 유리했다고 하네요.
Associative Recall 같은 과제가 시퀀스 모델의 특성에 대해 알려주는 것이 많죠. 이런 Synthetic한 토큰 조작 과제들이 Scaling Law와 어떻게 연결될 수 있는지를 테스트했습니다. 토큰 조작 과제들에 대한 성능과 Compute Optimal 모델의 Perplexity 사이에 선형적인 관계가 나타났네요. 즉 작은 규모의 Synthetic 과제에 대한 성능으로 효과적인 아키텍처를 찾을 수도 있다는 의미가 될 수 있습니다.
하이브리드 아키텍처는 Transformer의 이후 모델로서 꽤 유망하지 않은가 싶습니다. 더 효율적이라는 것 외에도 State Space Model이 모델의 특성에 줄 수 있는 바람직한 특징이 있지 않을까 하는 생각을 합니다.
#scaling-law #transformer #state-space-model