2025년 5월 12일
Seed-Coder: Let the Code Model Curate Data for Itself
(ByteDance Seed)
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
바이트댄스의 코드 모델. 휴리스틱 필터링 대신 LLM 필터링에 집중한다는 아이디어 (https://arxiv.org/abs/2412.02595). 준비된 모델이 없었는지 DeepSeek V2를 썼군요.
ByteDance's code model. The main idea is focusing on LLM-based filtering instead of heuristic filters (https://arxiv.org/abs/2412.02595). They used DeepSeek V2, which suggests they might not have had their own prepared model at that time.
#code #pretraining #corpus #reasoning #rl
Understanding Stragglers in Large Model Training Using What-if Analysis
(Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xiu Liu, Aurojit Panda, Jinyang Li)
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
대규모 모델 학습의 성능을 저하시키는 원인(Straggler)에 대한 분석. 각 워커에서 문제가 생기는 것보다는 시퀀스 길이의 불균등 같은 과제 자체의 문제와 가비지 컬렉터 등의 문제가 크다고 하는군요.
An analysis of stragglers, which reduce performance in large-scale model training. Rather than problems occurring in individual workers, issues on the task itself, such as imbalanced sequence lengths, or issues like garbage collection, appear to be more significant problems.
#efficiency