2024년 12월 3일
Yi-Lightning Technical Report
(01.AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang)
This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at
https://platform.lingyiwanwu.com
.
Yi-Lightning 테크니컬 리포트. DeepSeekMoE 스타일의 Fine-grained MoE, Sliding window Attention과 Global Attention의 조합, KV 캐시 재사용, DeepSeekMath 스타일의 수학 데이터 수집, In-context Pretraining 등. (https://arxiv.org/abs/2401.06066, https://arxiv.org/abs/2402.03300, https://arxiv.org/abs/2310.10638) 이 방법들이 요즘 일반적인 선택이라는 느낌이군요.
Technical report on Yi-Lightning. They employed fine-grained MoE in the style of DeepSeekMoE, combined sliding window attention with global attention, reused KV caches, collected mathematical data following the DeepSeekMath approach, and utilized in-context pretraining. (https://arxiv.org/abs/2401.06066, https://arxiv.org/abs/2402.03300, https://arxiv.org/abs/2310.10638) It seems these methods have become standard choices nowadays.
#llm
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
(Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang)
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
CLIP 학습을 위해 하나의 이미지에 대해 여러 측면에서 생성한 캡션을 사용.
Using multiple captions generated from various perspectives for a single image in CLIP training.
#captioning #clip #synthetic-data
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
(Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang)
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at
https://rand-ar.github.io/
.
랜덤 순서 토큰 예측으로 Autoregressive 이미지 생성. (https://arxiv.org/abs/2411.00776) 랜덤 위치의 토큰을 예측할 수 있기에 병렬적으로 여러 토큰을 예측할 수도 있고 인페인팅 등의 과제에 대해서도 잘 맞는군요. Autoregressive 이미지 생성을 한다면 고려할만한 선택이 되어가는 것 같네요.
Autoregressive image generation with random order token prediction. (https://arxiv.org/abs/2411.00776) As the model can predict tokens in random positions, it's possible to predict multiple tokens in parallel and handle tasks like inpainting more naturally. I think this approach is becoming a worthwhile consideration if you want to do autoregressive image generation.
#autoregressive-model #image-generation
Scaling Law for Language Models Training Considering Batch Size
(Xian Shuai, Yiding Wang, Yimeng Wu, Xin Jiang, Xiaozhe Ren)
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.
연산량과 데이터의 증가에 따른 최적 배치 크기의 Scaling Law. 학습 규모 증가에 따라 수많은 노브들을 어떻게 조절해야 할 것인지는 정말 성가신 문제죠. (https://arxiv.org/abs/2409.15156)
Scaling law for optimal batch sizes as computation and data increase. It's a truly vexing problem to determine how to adjust the numerous knobs as the scale of training increases. (https://arxiv.org/abs/2409.15156)
#scaling-law #hyperparameter
Optimality of Gerver's Sofa
(Jineon Baek)
We resolve the moving sofa problem by showing that Gerver's construction with 18 curve sections attains the maximum area 2.2195⋯2.2195⋯.
다른 분야이긴 합니다만 소파 옮기기 문제에 대해서 (https://en.wikipedia.org/wiki/Moving_sofa_problem) Gerver가 제안한 소파가 최적이라는 증명이 제안됐네요. 과연 결론이 어떻게 될지 궁금하군요.
Although this is from a different field, a preprint has been posted claiming that Gerver's sofa is optimal for the moving sofa problem (https://en.wikipedia.org/wiki/Moving_sofa_problem). I'm curious to see what the final conclusion will be.
#off-topic