2025년 4월 22일

Apr 22, 2025

Compute-Optimal LLMs Provably Generalize Better With Scale

(Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson)

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Chinchilla Scaling을 할 때의 Generalization Bound에 대한 분석. Random Guess, Loss Variation, Smoothing, Quantization 항으로 구성되어 있는데 모델 크기 증가에 따라 Loss Variation과 Quantization 항은 감소한다고 하는군요.

An analysis of the generalization bound when training with chinchilla scaling. The bound consists of random guessing, loss variation, smoothing, and quantization terms. The paper suggests that both the loss variation and quantization terms decrease as model size increases.

#llm #generalization

Efficient Pretraining Length Scaling

(Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou)

Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (PHD-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency.PHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: PHD-SWA employs sliding window attention to preserve local dependencies, while PHD-CSWA implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.

추론 CoT 같은 접근이 이젠 Length Scaling이라고 불리고 있군요. 프리트레이닝 시점에서 Length Scaling을 하려는 접근. 각 토큰을 복붙해서 K개로 증가시키고, 첫 토큰에 대해서만 KV 캐시에 저장, 나머지 토큰은 다음 토큰 예측에만 사용하는 방법입니다. 신박하긴 하네요.

It seems that approaches like reasoning CoT are now being referred to as "length scaling." This paper presents an approach to implement length scaling during pre-training. The method is duplicating each token K times, storing only the first token in the KV cache, and using the remaining tokens solely for next token prediction. It's quite a clever approach.

#pretraining

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

(Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang)

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Megatron의 MoE 학습 최적화 작업 결과군요. 주로 소개하는 것은 Attention과 MoE의 Parallelization 전략을 독립적으로 분리하는 것이네요.

This appears to be the result of Megatron's MoE training optimization work. The main improvement introduced is the independent separation of parallelization strategies for attention and MoE layers.

#moe #efficiency

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

(Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He)

The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

다양한 품질 지표를 결합해 프리트레이닝 데이터 필터링을 한다는 아이디어. 어쩌면 쓸만한 지표들의 앙상블 자체가 유용한 것일지도 모르겠네요.

The idea of filtering pretraining data by combining various quality metrics. Perhaps the ensemble of various useful metrics itself could be valuable.

#pretraining #corpus

2025년 4월 22일

Compute-Optimal LLMs Provably Generalize Better With Scale

Efficient Pretraining Length Scaling

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Discussion about this post