2024년 8월 21일
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
(Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy)
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.
텍스트는 Autoregression으로, 이미지는 Diffusion으로라는 아이디어의 연장선. (https://arxiv.org/abs/2403.05196, https://arxiv.org/abs/2406.11838) 여기서는 더 극단적으로 이미지에 대해서는 Causal Mask를 빼버렸고 입력 이미지 토큰 자체에 노이즈를 주입했습니다. 물론 이러면 Captioning에서는 문제가 될 수 있으니 배치 내 일부 샘플에는 최대 노이즈에 한도를 걸었네요.
저는 이런 결합이 인식과 생성을 동시에 잡을 수 있는 유망한 방법이 아닐까 싶습니다. 물론 더 심플한 방법이 가능하다면 좋겠지만요. (이것도 규모가 커지면 상관이 없어질 문제일지도 모르겠네요.)
#autoregressive-model #diffusion
To Code, or Not To Code? Exploring Impact of Code in Pre-training
(Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker)
Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.
코드 데이터의 사용이 코드 관련 도메인 외에도 자연어 성능에 영향을 미치는가? 은연중에 알려져 있었던 사실에 대한 검증이라고 할 수 있겠네요. 이제 더 흥미로운 "왜?"가 아닐까 싶습니다.
#pretraining #llm #code
Scaling Law with Learning Rate Annealing
(Howe Tissue, Venus Wang, Lu Wang)
We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps (s): L(s) = L_0 + A·S_1^-α - C·S_2 Where S_1 is forward area and S_2 is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.
이건 신박한 결과네요. Learning Rate Schedule에 따른 학습 과정의 Loss 변화를 Scaling Law로 예측할 수 있다는 결과입니다. 이 Scaling Law는 두 가지 요인, 하나는 Learning Rate의 합이 대표하는 총 학습량과 Learning Rate Decay로 인한 추가 Loss 감소 요인으로 구성됩니다.
Learning Rate Schedule은 이 두 가지 요인의 균형을 잡는 문제가 됩니다. 이를 통해 WSD와 Cosine 스케줄을 비교하고 WSD 스케줄의 Annealing 비율이 어떻게 변화해야 하는지, 그리고 Annealing 구간의 커브의 개형이 어때야 하는지 등등을 예측했네요. 굉장히 재미있습니다.
#scaling-law #optimization