2024년 4월 23일

Apr 23, 2024

Common Crawl 15T 데이터. WET이 아닌 WARC로 Trafilatura를 썼습니다. CCNet vs Extractor 기반 방법의 차이도 개인적으로는 늘 궁금하네요.

URL Filtering, Language Filter 다음 휴리스틱 (Gopher, C4 + 알파) 기반 퀄리티 필터링. 모델 기반 필터링은 의도적으로 제외했습니다. Fuzzy Dedup, 단 요즘 이야기가 나오는 전체 데이터셋에 대한 Dedup이 아니라 덤프별 Dedup를 사용했고 이쪽이 더 나았다고 주장합니다. (https://arxiv.org/abs/2401.02954)

모델 기반 필터링을 휴리스틱 기반 필터링으로 완전히 대체하는 것은 다들 시도했다가 고생하는 방향인 것 같긴 한데 어떨지 모르겠네요.

C4의 휴리스틱들이 (구두점이 없거나 "Javascript"가 포함되거나 중괄호가 포함된 라인 제거) 너무 강하지 않나 하는 생각도 했었는데 Extractor로 걸러낸 다음 구두점 강제 등의 조건은 풀고 이 조건들로 인해 빠지는 데이터들은 따로 추가하는 식으로 접근할 수도 있겠다는 생각도 듭니다.

데이터셋 실험 결과들이 나올 때마다 RefinedWeb이 그렇게 나쁘지 않았던 것이 아닌가 하는 생각을 합니다. 그런데 Falcon이 RefinedWeb만으로는 그렇게 재미를 보지 못했던 것을 보면 Common Crawl만으로는 한계가 있다는 의미로 생각해야 하지 않을까 싶기도 하네요. 리포트가 나오면 좀 더 생각해볼 수 있을 듯 합니다.

#dataset #corpus

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

(Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Olatunji Ruwase, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yunan Zhang, Xiren Zhou)

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

Phi-3가 나왔군요. 3.8 - 14B 모델을 3.3 - 4.8T 학습시켰습니다. 벤치마크 성능은 무시무시한 수준인데 실제 성능으로 얼마나 이어질지가 문제이긴 하네요. Phi 모델들에 대한 여론은 늘 이런 것이었지만요.

#llm

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

(Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel)

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

Instruction의 위계를 모델에 부여하기 위한 방법. 예를 들어 시스템 메시지는 가장 우선해야 하고 모델 출력은 하위에 있어야 하죠. Anthropic의 Many-shot Jailbreaking도 생각나네요. (https://www.anthropic.com/research/many-shot-jailbreaking)

방법이 좀 흥미롭습니다. 일단 Aligned Instruction에 대해서는 복합적인 Instruction으로 생성한 다음 이를 쪼개서 시스템 프롬프트로 Instruction이 들어갔을 때와 분해한 Instruction을 서로 다른 위계에 배치했을 때의 출력 결과가 같게 만듭니다. Misaligned의 경우에는 제약 조건에 대한 시스템 프롬프트를 생성하고 이를 공격하기 위한 프롬프트를 생성한 다음 이 공격 프롬프트가 주어지지 않았을 때와 같은 결과가 나오게 하거나 제약 조건과 부합하는 프롬프트가 나올 때까지 생성한 결과로 학습시키는 형태입니다.

재미있네요. 중요한 결과일 것 같습니다. 더 나아가 모델 출력에 대해 Downweight 해서 모델의 스스로의 생성 결과에 대한 의존성을 낮추는 방향으로도 이을 수 있지 않을까 하는 생각이 듭니다.

#instruction-tuning #safety

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

(Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar)

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.

RLHF 방법들의 논쟁점들, On Policy vs Off Policy, Negative Gradient (Likelihood)를 사용할 것인가의 문제에 대한 종합적인 분석이군요. On Policy와 Negative Gradient를 사용하는 쪽이 효과적이라는 결론입니다.

On Policy, Negative Gradient가 레퍼런스 분포에 비해 Reward의 피크 분포가 멀리 떨어져 있을 때 효과적인데 이는 이 두 방법이 Reverse KL Objective이고 따라서 Mode Seeking이기 때문입니다. 반대로 그렇지 않은 방법은 Mode Covering을 하게 되죠.

PPO를 진행할 때 Diversity가 감소하는 특성 등을 고려하면 자연스러운 흐름이긴 합니다.

#rlhf

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

(Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan)

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.

이미지 입출력이 가능한 Vision Language 모델. Condition 입력이 가능한 ViT 인코더/디코더를 사용하고 이미지 생성 쿼리 입력이 들어온 경우 이 이미지 Feature를 예측하게 하는 방식이군요. 그 외에는 요즘 정석적인 Global 이미지와 크롭 이미지를 입력으로 주고 Detection Objective를 추가했습니다.

#vision-language #image-generation

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

(Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman)

When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

과제에 대해 Principle들을 생성해서 Constitution을 구성하고 이 Constitution과 쿼리를 입력으로 해서 생성한 응답에 대해 서로 다른 Constitution으로 생성한 응답과의 Contrastive Loss로 정렬하는 방법. 흥미롭네요. Constitution 기반 Helpfulness 정렬은 여전히 흥미로운 주제가 아닌가 싶습니다.

#rlaif