2024년 5월 24일
Scalable Optimization in the Modular Norm
(Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein)
To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the "natural norm" particular to that layer. In this paper, we significantly generalize this idea by defining the modular norm, which is the natural norm on the full weight space of any neural network architecture. The modular norm is defined recursively in tandem with the network architecture itself. We show that the modular norm has several promising applications. On the practical side, the modular norm can be used to normalize the updates of any base optimizer so that the learning rate becomes transferable across width and depth. This means that the user does not need to compute optimizer-specific scale factors in order to scale training. On the theoretical side, we show that for any neural network built from "well-behaved" atomic modules, the gradient of the network is Lipschitz-continuous in the modular norm, with the Lipschitz constant admitting a simple recursive formula. This characterization opens the door to porting standard ideas in optimization theory over to deep learning. We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via "pip install modula" with source code at https://github.com/jxbz/modula.
Loss가 Weight에 대해 Lipschitz Continuous하도록 하는 Norm을 구성하는 것, 특히 다양한 모듈로 구성된 Neural Network에 대해 쉽게 계산할 수 있는 Norm을 설계했습니다. 실용적으로는 이 Norm을 사용해 Weight Update를 정규화하면 서로 다른 깊이와 너비의 모델에 대해 하이퍼파라미터를 Transfer 하는 것이 가능하네요.
#optimization #neural-network
Base of RoPE Bounds Context Length
(Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen)
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
RoPE의 base 파라미터를 Context Length에 따라 어떻게 설정해야 하는가에 대한 체계적인 연구가 나왔네요. Long Context를 제대로 활용해야 한다, 더 구체적으로는 가까운 토큰에 집중하고 쿼리와 유사한 토큰을 잘 구분할 수 있어야 한다는 조건 하에서 Context Length를 실제로 잘 활용하기 위한 base의 최저선을 설정하는 형태입니다. 경험적으로 세팅해왔던 부분인데 흥미로운 진전인 것 같습니다.
#positional-encoding
Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction
(Maciej Kilian, Varun Japan, Luke Zettlemoyer)
Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.
Next Token Prediction vs Masked Token Prediction vs Diffusion의 Text-to-Image에 대한 Scaling Law. 흥미롭게도 Next Token Prediction이 가장 나은 성능을 찍고 있습니다. 다만 Scaling Curve의 경사가 Diffusion이 더 큰 것 같다는 느낌이 있네요.
추가로 LFQ (https://arxiv.org/abs/2310.05737) 의 사용도 흥미롭습니다.
#autoregressive-model #diffusion #scaling-law
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
(Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang)
Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.
이야기만 나오던 Lean 같은 Proof Assistant와 LLM을 결합한 증명 탐색과 데이터 생성, 학습 방법의 구체적인 사례가 등장했군요. 웹 데이터를 Formal Statement로 변환한 다음 증명을 탐색하고 그 결과를 통해 학습시키는 방법입니다.
7B 모델로 실험했는데 더 강력한 모델을 사용한다면 어디까지 도달할 수 있을지 궁금해지네요. DeepSeek의 연구들은 늘 인상적입니다.
#synthetic-data #search
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
(Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel)
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training large language models: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.
자주 나오는 프리트레이닝시 배치 구성 문제. 시퀀스를 일정 길이로 잘라서 비슷한 길이의 시퀀스끼리 묶어 버킷을 만들어 학습시키는 방식이군요. 학습 효율성을 확보하면서 문서간 Attention의 오염을 막기 위한 방법입니다. 버킷을 활용해 커리큘럼도 시도했네요.
문서간 Attention을 막는 것에 집중한다면 Attention Mask를 효율적으로 적용하는 방향으로 고민하는 것 최종적으로는 답이 아닐까 싶긴 합니다.
#efficient-training