2025년 3월 15일

Mar 15, 2025

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

(Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, Arthur Douillard)

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

DiLoCo에 대한 Scaling Law. 학습 규모에 따라 Data Parallel과의 차이가 감소하고 심지어 더 나아지는 구간이 발생하네요. Local SGD에 대한 Scaling Law도 이전에 나왔었죠 (https://arxiv.org/abs/2409.13198).

DiLoCo를 상당히 열심히 개발하고 있네요. 요즘처럼 기술이 기밀이 되는 시대에 계속해서 공개되고 있다는 것이 좀 더 수상하게(?) 만드는 것은 있습니다.

Scaling law for DiLoCo. As training scale increases, the performance gap with data parallel decreases, and there are even points where it surpasses. A scaling law for local SGD was also published previously (https://arxiv.org/abs/2409.13198).

DiLoCo is being developed quite intensively. Ironically, in an era where technologies are increasingly becoming trade secrets, the continued publication of DiLoCo makes it somewhat suspicious.

#efficient-training #scaling-law

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

(Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, and Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, and Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, and Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, and Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, and Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, and Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, and Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu)

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

8T 규모의 다국어 데이터와 병렬 데이터를 포함한 코퍼스. Common Crawl 뿐만 아니라 Internet Archive 데이터를 전처리했다는 것이 흥미롭네요.

A corpus containing 8T of multilingual data, and parallel data. It is interesting that they preprocessed not only Common Crawl but also Internet Archive data.

#corpus #multilingual

Autoregressive Image Generation with Randomized Parallel Decoding

(Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang)

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.

Random order Autoregressive Image Generation. 이전 연구들이 위치 토큰을 더하거나 추가했다면 (https://arxiv.org/abs/2411.00776, https://arxiv.org/abs/2412.01827) 이쪽은 Cross Attention을 사용해서 위치 토큰을 Q로, 이미지 토큰을 KV로 사용한 형태네요.

디코더 내에서 Cross Attention을 사용한 구조는 이전에 등장했었죠. (https://arxiv.org/abs/2405.05254) 흥미롭네요.

Random order autoregressive image generation. While previous research added or appended positional tokens (https://arxiv.org/abs/2411.00776, https://arxiv.org/abs/2412.01827) this study adopted cross attention and used position tokens as Q, and image tokens as KV.

The use of cross attention within decoder has appeared before (https://arxiv.org/abs/2405.05254). Interesting.

#autoregressive-model #image-generation

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

(Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes)

Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

Scaling Law를 추정할 때 Validation으로 지식 관련 데이터를 사용하는가, 코딩 관련 데이터를 사용하는가에 따라서 Compute Optimal 지점의 파라미터와 데이터 할당 비율이 달라진다는 연구. 지식 관련에서는 파라미터에, 코딩 관련에서는 데이터에 더 할당하게 되는군요. 흥미로운 결과네요.

This study shows that the compute optimal allocation of parameter and data varies depending on whether knowledge-related or code-related data is used for validation when estimating scaling law. When using knowledge-related validation data more compute is allocated to parameters, while with code-related validation data, more is allocated to data size. Intriguing results.

#scaling-law

Transformers without Normalization

(Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu)

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(αx), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Layer Norm을 Tanh로 교체한 형태의 Norm Free Network군요. 최대값을 Bound하는 것만으로도 학습에 충분하다는 의미일 수 있겠네요.

Norm Free Network의 사례들은 Norm을 사용한 경우와 결과가 거의 같다는 것이 오히려 흥미롭습니다. Post Norm 같은 사례들을 생각하면 Norm의 여부 자체보다는 Norm을 어떻게 배치하는지가 더 영향이 큰 것 같네요.

Norm free network that replaces layer norm with tanh. This suggests that simply bounding the maximum values maybe sufficient for training.

What's particulary interesting about norm free network is their performance matches that of networks with normalization. When we consider examples like post norm, it seems that how we position the normalization matters more than whether we use normalization at all.

#normalization

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

(Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li)

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

One-sided Shampoo 같은 구조를 더 부여한 Preconditioner가 Full-matrix Adagrad 같은 원 방법보다 오히려 더 나은 수렴률을 가질 수 있다는 분석.

The analysis suggests that a more structured preconditioner such as one-sided shampoo can achieve better convergence rates than "original" method like full-matrix adagrad.

#optimizer