2024년 3월 26일

Mar 26, 2024

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

(Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, Xipeng Qiu)

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules

프리트레이닝 데이터셋 비율에 따른 Loss에 대한 Law. 사실 크기를 키우는 것과 관련된 Law는 아니라서 저자의 언급처럼 Scaling Law라고 말하기는 어렵겠네요.

기본적으로 도메인 i에 대한 Loss(i) = exp(ratio(i))라는 함수형입니다. 여기에 더해 이 도메인들을 종합한 데이터에 대한 Validation Loss를 구하는데 여기서 각 도메인의 비율을 알고 있지 않을 때에 대해서 적용할 수 있도록 Validation 셋이 가상의 도메인 K개로 구성되어 있다고 보고 추정합니다. 추정할 파라미터 수가 크게 늘어나는 감이 있긴 합니다.

Downstream 과제에 대한 Scaling Law (https://arxiv.org/abs/2403.08540) 와 결합하면 프리트레이닝 의사 결정에 좀 더 도움이 될 수도 있지 않을까 하는 생각이 있습니다.

#scaling-law

Understanding Emergent Abilities of Language Models from the Loss Perspective

(Zhengxiao Du, Aohan Zeng, Yuxiao Dong, Jie Tang)

Recent studies have put into question the belief that emergent abilities in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We demonstrate that the models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks. We also discover that a model exhibits emergent abilities on certain tasks -- regardless of the continuity of metrics -- when its pre-training loss falls below a specific threshold. Before reaching this threshold, its performance remains at the level of random guessing. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by merely extrapolating the performance trends of models with higher pre-training losses.

LLM의 Emergent Behavior 문제. (https://arxiv.org/abs/2206.07682, https://arxiv.org/abs/2304.15004) 직접 다양한 모델을 다양한 토큰 수에 대해 학습시킨 다음 Loss에 따른 성능 그래프를 그렸습니다. Brier Score 같은 연속적인 척도를 사용하는 경우에도 그래프에서 Elbow가 나타납니다.

#scaling-law #llm

DreamLIP: Language-Image Pre-training with Long Captions

(Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen)

Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (e.g., with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on a wide rage of downstream tasks demonstrate the consistent superiority of our method, termed DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs. Project page is available at https://zyf0619sjtu.github.io/dream-lip.

긴 캡션을 생성한 다음 이 캡션을 쪼개 부분 캡션들을 만들고 이 캡션들과의 Contrastive Loss + 캡션과 이미지 패치의 Cross Attention 가중치를 Sparse 하게 만든 다음 Contrastive Loss를 적용. (https://arxiv.org/abs/2401.09865) 텍스트 토큰 레벨인가 부분 캡션 레벨인가라는 선택이 하나 발생하는군요.

아예 이미지를 쪼갠 다음 각 부분에 대해 상세한 캡션을 만드는 것을 목표로 했던 사례 (https://arxiv.org/abs/2312.08578) 와 결합해서 생각해보면 좀 더 재미있을 듯 싶습니다.

#clip #synthetic-data

2024년 3월 26일

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Understanding Emergent Abilities of Language Models from the Loss Perspective

DreamLIP: Language-Image Pre-training with Long Captions

Discussion about this post