2024년 9월 25일

Sep 25, 2024

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

(Lechao Xiao)

The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: • Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? • Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

과거 데이터가 적고 모델이 커서 학습셋에 대한 성능에는 문제가 없고 학습셋과 테스트셋에 대한 성능 차이가 문제가 됐던 시절(Generalization Gap)에서 데이터가 늘어나 학습셋과 테스트셋의 성능 차이에는 문제가 없고 학습셋에 대한 성능이 문제가 되는 시점, Scaling의 시대가 되면서 이전에 통용됐던 직관들이 맞지 않을 수 있다는 주장. Regularization이 필요하지 않고 오히려 해로운 상황이 된 것이죠.

그래서 큰 LR이나 작은 배치 크기가 좋다는 발상이 맞지 않는다는 이야기를 합니다.

그 이후 LR이나 Weight Decay의 Scaling 방법에 따라서 작은 규모에서는 문제가 없었던 방법들이 큰 규모로 가면서 문제가 발생하는 상황들에 대해 언급합니다. 그리고 이 때문에 큰 규모에서 모델과 하이퍼파라미터를 비교하는 것이 어렵다는 이야기를 하네요.

생각을 많이 해보게 되네요.

#scaling-law #hyperparameter

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

(Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase)

Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

Tensor Parallel에서 Communication과 Computation의 중복을 최대화하려는 시도인데...Transformer의 레이어가 XAB 형태라고 한다면 X를 Row Sharding하고 B를 Column Sharding한 형태군요. A는 Sharding을 하지 않습니다. A를 Sharding하지 않는다는 것은 조금 아쉽지 않을까 싶네요.

#parallelism #efficient-training

MonoFormer: One Transformer for Both Diffusion and Autoregression

(Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang)

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at
https://monoformer.github.io/
.

Transfusion과 (https://www.arxiv.org/abs/2408.11039) 비슷한 Autoregresssive 텍스트 생성과 Diffusion 기반 이미지 생성.

#diffusion #autoregressive-model #text-to-image

MaskBit: Embedding-free Image Generation via Bit Tokens

(Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen)

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

VQGAN을 개선하면서 Lookup-Free Quantization을 적용. (https://arxiv.org/abs/2310.05737) LFQ를 적용하면 K 차원의 바이너리 코드가 나오니 굳이 임베딩을 사용해서 변환하지 않고 그대로 Feature로 사용. 이 코드에 구조가 있다는 것을 발견했고 (즉 코드가 유사하면 결과도 유사) Masked Image Generation으로 이미지를 생성할 때에도 임베딩 없이 입력으로 사용하는군요. (출력은 여전히 2^K개 클래스 분류인 것 같긴 하네요.)

#vq #image-generation #mlm

Looped Transformers for Length Generalization

(Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee)

Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

RASP-L이라는 제한된 형태의 프로그램을 실행하도록 Universal Transformer처럼 같은 레이어를 반복 사용하는 모델을 학습. 여기서 Length Generalization이 가능하게 한 방법은 프로그램을 실행하는데 필요한 단계의 수만큼 레이어를 반복 적용했다는 것이네요.

실용적으로 필요한 반복 횟수를 알 수 있는 문제는 별로 없겠지만...Universal Transformer처럼 특정한 구조를 주입하는 것이 일반화 가능한 모델로 이어진다는 사례 중 하나로 생각할 수 있겠군요.

#transformer

Reward-Robust RLHF in LLMs

(Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen)

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be effective in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment with RLHF.

Reward Model을 노이즈에 강인하게 만들기 위해 일단 헤드 하나짜리 Reward Model을 만들고, 베이스 모델을 유지한 상태로 헤드를 추가해서 기존 헤드의 출력에 대한 Regression으로 학습시켜 앙상블 효과를 낸다는 아이디어군요.

#reward-model