2025년 2월 24일
Muon is Scalable for LLM Training
(Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang1 Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang)
Recently, the Muon optimizer (K. Jordan et al. 2024) based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.Recently, the Muon optimizer (K. Jordan et al. 2024) based on matrix orthogonalization has demon- strated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Moonshot AI에서 Muon Optimizer에 (https://kellerjordan.github.io/posts/muon/) 대한 대규모의 학습 실험과 Scaling Law 추정 결과를 공개했군요. Adam에 대해 2배 정도의 성능 향상을 보고했습니다. All Gather를 사용해서 업데이트를 계산하는 간단한 방법을 썼네요.
구글이 Second-order Optimizer를 사용한다는 것은 알려져 있었는데 본격적으로 자세한 결과들이 나오기 시작하네요. 판도가 바뀔지도 모르겠습니다.
Moonshot AI has released results from large-scale experiments and scaling law estimates for the Muon optimizer (https://kellyjordan.github.io/posts/muon/). They reported about a 2x performance improvement compared to Adam. They used a simple method of calculating updates using all-gather operations.
It was known that Google uses second-order optimizers, but now we're starting to get detailed results. This might change the canonical method of LLM training.
#optimizer #scaling-law