2024년 11월 27일
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
(Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu)
We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.
Quantization으로 인해 발생하는 손실에 대한 Scaling Law. Quantization 이후와 이전의 Loss 차이가 학습 토큰 수에 비례하고 모델 크기와 Precision에 반비례한다는 형태입니다. (https://arxiv.org/abs/2411.04330) 결과적으로 Quantization에 대해서는 Undertrain된 모델이 더 선호된다는 것이죠. 파레토 최적 지점이 궁금하네요.
조금 다른 아이디어인데 Loss의 차이가 작다면 Undertrain 되었다고 생각할 수 있고 반대로 어느 정도까지 더 학습을 시킬 수 있는가를 추정할 수 있지 않을까 하는 이야기를 합니다.
Scaling law for the loss incurred due to quantization. The formula shows that the difference in loss between before and after quantization is proportional to the number of training tokens and inversely proportional to model size and precision. (https://arxiv.org/abs/2411.04330) Consequently, undertrained models are more favorable for quantization. I'm curious about the pareto optimal point in this context.
The paper also presents an interesting idea. If the loss difference is small, we can infer that the model is undertrained. Conversely, this might allow us to estimate how much further we can train a model.
#quantization #scaling-law
Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers
(Benedikt Stroebl, Sayash Kapoor, Arvind Narayanan)
Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models, such as by repeatedly sampling solutions to a coding problem until it passes unit tests. The central thesis of this paper is that there is no free lunch for inference scaling: indefinite accuracy improvement through resampling can only be realized if the "verifier" (in this case, a set of unit tests) is perfect. When the verifier is imperfect, as it almost always is in domains such as reasoning or coding (for example, unit tests have imperfect coverage), there is a nonzero probability of false positives: incorrect solutions that pass the verifier. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling even with an infinite compute budget. We find that there is a very strong correlation between the model's single-sample accuracy (i.e. accuracy without unit tests) and its false positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model (Fig. 1a). When we consider that false positives have a negative utility compared to abstaining from producing a solution, it bends the inference scaling curve further downward. Empirically, we find that the optimal number of samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
Verifier가 불완전한 상황에서 탐색의 양을 늘리는 것은 False Positive를 늘리는 것이 될 수 있다는 주장. 문제의 난이도에 따른 패턴의 차이도 그렇고 (https://arxiv.org/abs/2408.03314) Inference Scaling Law는 Training Scaling Law에 비해 훨씬 더 복잡한 패턴이 나타날 수 있다는 생각입니다.
This paper argues that increasing the amount of sampling can lead to more false positives when the verifier is imperfect. Considering the differences in patterns based on problem difficulty (https://arxiv.org/abs/2408.03314), I believe that inference scaling laws can exhibit much more complex patterns compared to training scaling laws.
#search
Scaling Speech-Text Pre-training with Synthetic Interleaved Data
(Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang)
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
텍스트를 음성 토큰으로 바꾸는 모델을 만들고 이 모델로 텍스트 중간 중간을 음성 토큰으로 바꾸는 형태로 Interleaved 데이터를 구축. 재미있네요.
The authors created a model that converts text to speech tokens, and then used this model to construct interleaved data by replacing portions of the text with speech tokens. Interesting.
#speech-text #synthetic-data a
Towards Precise Scaling Laws for Video Diffusion Transformers
(Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai)
Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
Video Diffusion에 대한 Scaling Law. 사실 가장 힘을 싣고 있는 것은 배치 크기와 Learning Rate에 대한 Scaling Law로 최적 하이퍼파라미터를 결정하는 부분이네요.
Scaling laws for video diffusion models. The authors place particular emphasis on developing scaling laws for batch size and learning rate to determine optimal hyperparameters.
#diffusion #video-generation #scaling-law