2024년 9월 4일
OLMoE: Open Mixture-of-Experts Language Models
(Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi)
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
OLMo의 1B Activated 7B Weight MoE 모델. 유례 없이 상세하네요.
일단 이쪽은 Dolma 대신 DCLM으로 넘어갔군요. DeepSeek 스타일의 작은 Expert를 여러 개 사용하는 방식을 (64 Expert, 8 Activated) 사용하지만 Shared Expert는 사용하지 않았습니다. Token Choice, Router z-loss, QK Norm을 사용했습니다. 특히 QK Norm에 대해서는 성능적인 이점도 발견했군요.
Truncated Normal에 대한 부분은 좀 더 흥미롭네요. 고정된 Init std를 사용하기 때문일 수도 있지만 Normal을 사용하는 경우에는 학습이 깨지고 Truncated Normal을 사용하는 경우에는 안정한 사례를 발견했군요. 3 std로 끊었는데 그렇다면 0.27%의 파라미터가 3 std보다, 대부분 살짝 더 컸을 텐데 이것조차 큰 영향을 미쳤다는 것이겠죠. 이 부분은 생각해보면 재미있을 것 같습니다. (여담이지만 DeepSeek는 Normal 0.006을 사용했다고 합니다.)
파라미터의 수가 워낙 많으니 3 std를 넘어가는 파라미터의 절대적인 수는 적지는 않겠네요. (https://en.wikipedia.org/wiki/68–95–99.7_rule) 놓치던 부분인데 신경써도 좋을 것 같습니다.
#moe
Imitating Language via Scalable Inverse Reinforcement Learning
(Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jorg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, Sarah Maria Elisabeth Bechtle, Jost Tobias Springenberg, Nikola Momchev, Olivier Bachem, Matthieu Geist, Martin Riedmiller)
The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.
SFT를 Imitation Learning 관점에서 개선한 방법. MLE + Value function을 사용한 Regularization이라는 느낌이군요.
#rl #alignment