2024년 9월 11일
Improving Pretraining Data Using Perplexity Correlations
(Tristan Thrush, Christopher Potts, Tatsunori Hashimoto)
Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.
도메인에 대한 Likelihood와 벤치마크의 성능 사이의 상관계수를 통해 벤치마크에 가장 도움이 되는 도메인을 Positive로 하는 분류기를 학습시키고 이를 통해 샘플링한다는 아이디어. 도메인과 벤치마크의 관계를 예측한 연구의 연장선상이라고 할 수있겠네요. (https://arxiv.org/abs/2404.09937)
저는 좋은 LLM을 만들기 위해서는 벤치마크에 무심해야 한다고 생각합니다. 그렇지만 도메인 비율 설정이나 모델 기반 퀄리티 필터링에 대해 좀 더 나은 접근이 필요한 것도 사실이겠죠.
#corpus
Geometric-Averaged Preference Optimization for Soft Preference Labels
(Hiroki Furuta, Kuang-Huei Lee, Shixiang Shane Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur)
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic. However, it is reasonable to think that they can vary with different individuals, and thus should be distributional to reflect the fine-grained relationship between the responses. In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function. In doing so, the scale of learning loss is adjusted based on the soft labels, and the loss with equally preferred responses would be close to zero. This simple modification can be easily applied to any DPO family and helps the models escape from the over-optimization and objective mismatch prior works suffer from. In our experiments, we simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research. In particular, we observe more preferable responses than binary labels and significant improvements with data where modestly-confident labels are in the majority.
Preference Label이 0, 1이 아니라 Soft Label인 경우에 대해 DPO를 적용하려는 시도. 여기서 Soft Preference Label은 모델이 학습 과정에서 추정한다거나 하는 형태는 아니고 LLM을 통해 기존 데이터를 Relabeling 하는 식으로 수집했습니다. 즉 Soft Label 형태의 Ground Truth가 있는 상황에 대한 방법이라고 할 수 있겠네요.
#alignment #rl