2024년 7월 24일

Jul 24, 2024

Llama 3 405B 모델 공개와 함께 기대했던 테크니컬 리포트가 나왔다. Llama 2 리포트보다 훨씬 더 많은 정보가 포함되어 있어 굉장히 흥미롭다.

1. 프리트레이닝

웹 데이터 전처리에 대해 비교적 상세하게 기술하고 있다. 자체 개발한 Main Content Extractor를 사용했다. 전체 데이터에 대한 Deduplication을 수행했다고 하는데 Global Deduplication을 시사하는 것인지 궁금한 점이 있다. 추가적으로 CCNet 스타일의 Line Deduplication으로 Boilerplate를 추가 제거. C4/Gopher 스타일 휴리스틱 필터와 (아마도 LM 기반의) 텍스트 분포에서의 아웃라이어들에 대한 필터링도 사용.

Wikipedia를 레퍼런스로 잡은 fastText, Llama 2 예측 기반의 DistilRobert 분류기로 퀄리티 필터링.

DeepSeek 스타일의 수학과 코드 도메인에 특화한 웹 페이지 추출.

분류기를 사용해 웹 데이터의 도메인을 분류하고 Scaling Law 추정을 통해 데이터 믹스를 결정. 일반 지식 50%, 수학 및 추론 관련 25%, 코드 17%, 다국어 8%.

고품질 데이터로 학습 최종 단계에서 Annealing 적용. 역으로 Annealing을 사용해 데이터셋의 퀄리티를 검증하기도 함.

문서 간 Attention을 차단하기 위한 마스킹 적용. Long Context 추가 학습에 유용했다고 언급. Polyak Averaging도 적용.

다운스트림 과제에 대한 Likelihood에 대해 Scaling Law를 추정한 다음 Likelihood와 과제에 대한 스코어의 함수를 추정하는 방식으로 다운스트림 과제에 대한 Scaling Law를 추정.

Pipeline Parallel의 배치 크기 제약을 완화하고 Ring 대신 All-Gather 기반 Context Parallel을 사용.

2. 포스트트레이닝

Reward Modeling에서 시작해서 Rejection Sampling으로 생성한 데이터로 SFT를 하고 DPO를 하는 흐름. 즉 PPO를 쓰지 않고 SFT 단계에서도 모델 생성 데이터가 주축이 된다.

포스트트레이닝 데이터에 대해서는 강하게 퀄리티 컨트롤을 했다. 수작업으로 데이터를 필터링하고, 모델 기반의 퀄리티 필터링과 난이도에 따른 비율 조정, 그리고 Semantic Deduplication을 적용.

포스트트레이닝 시점에서는 각 도메인에 대해 특화된 데이터 구축 작업들을 진행했다.

2.1. 코드

코드 특화 모델을 구축하는 것에서 시작. 레포 레벨 데이터도 활용했다. 컴파일러 피드백과 모델로 생성한 유닛 테스트를 사용한 피드백을 사용해 데이터를 개선하고 학습.

서로 다른 프로그래밍 언어간 번역, 코드 설명, 생성, 문서화, 디버깅 등의 과제에 대해서 모델이 응답을 생성하게 한 다음 원 코드로 Backtranslation을 하게 하고 출력 결과의 퀄리티를 사용해 필터링.

2.2. 다국어

다국어에 대해서도 특화 모델을 학습. NLP 데이터셋과 사람이 작성한 프롬프트를 사용하고 Rejection Sampling을 적용한 다음 룰 기반 퀄리티 필터링. 기계 번역은 의도적으로 제외하려고 노력.

2.3. 수학

프리트레이닝 데이터와 사람을 통해 프롬프트를 구축. 모델로 CoT 응답을 생성한 다음 모델 기반으로 검증. Process Reward Model을 사용해 필터링하고, 어려운 문제의 경우에는 MCTS와 Process Reward Model을 사용해 응답을 생성.

2.4. 추론

텍스트와 코드를 사용해 추론 문제를 풀도록 학습. 코드 실행 피드백을 사용하고 잘못된 생성 결과를 모델을 통해 오류를 교정.

2.5. Long Context

긴 문서를 청킹한 다음 각 청크에 대해 QA를 생성하고 합치는 전통적인 방식에 마찬가지로 청크에 대해 요약한 다음 이 요약을 요약하는 방법을 사용. 파이썬 코드에 대해 Dependency Sorting을 하고 가장 많이 참조된 파일을 지운 다음 이 파일의 코드를 생성하도록 하는 방법도 사용.

2.6. 도구 사용

웹 검색, 파이썬 인터프리터, Wolfram Alpha에 대한 도구 사용 능력을 학습시켰음. 이쪽은 사람이 직접 데이터를 작성하는 방식. 다만 합성 데이터를 사용해 기본적인 도구 사용 능력을 습득시킨 다음에 시작.

2.7. Factuality

프리트레이닝에서 문서를 가져와서 질문을 생성하게 한 다음 응답을 샘플링. 문서와 응답을 통해 Llama 3 기반으로 정확성과 정보의 품질을 평가. 지속적으로 오답이 발생하는 경우에는 Refusal을 생성.

2.8. Steerability

어노테이터들이 시스템 프롬프트를 만들게 한 다음 대화를 진행하고 어노테이션을 하도록 함.

2.9. Safety

어노테이터를 통해 Adversarial 프롬프트를 구축하고 Automatic Red Teaming도 적용.

3. 비전, 오디오

비전의 경우에는 이미지 인코더에 대한 Cross Attention, 오디오의 경우에는 인코더 출력을 모델 입력으로 바로 Projection.

소감

웹 데이터 처리는 높은 차원에서는 현재 정석적으로 여겨지는 방법들을 채택. DeepSeek 스타일의 코드 및 수학 도메인 데이터 발굴이 중요한 수단이라는 것이 재차 검증됨. 추가적으로 Scaling Law를 사용한 데이터 믹스 결정 등 체계적인 방법도 흥미로운 지점.

안정적이고 효율적인 학습 인프라 구축을 위해서는 통신과 Parallelism의 바닥부터 직접 손을 대어야 한다는 것을 증명.

포스트트레이닝은 더 확연하게 Preference Data로 무게중심이 옮겨짐. SFT도 대부분이 모델 생성 데이터를 Reward Model로 Rejection Sampling한 샘플에 대해 학습시키는 것으로 구성됨. 사실상 온라인 샘플을 통해 학습하게 되면서 PPO 없이 DPO만으로 학습하는 것이 충분히 좋은 선택이 됨.

코드와 수학에 대해 코드 실행/컴파일러 피드백과 Process Reward Model/MCTS가 중요한 구성 요소가 됨.

포스트트레이닝에 언급된 방법 하나하나가 모두 각각 한 편의 논문이 될 수 있는 테크닉들. 그리고 이 모든 테크닉들을 종합해서 포스트트레이닝이 구성됨. 사실 이것은 프리트레이닝과 학습 인프라 구축에서도 마찬가지. 지금 시점의 프런티어 모델은 최첨단을 달리는 다양한 기술들을 종합적으로 모델 하나에 집중해서 만들어지고 있음. GPU 숫자 같은 명시적인 연산력에 밀려 흔히 가려지는 부분이지만 이 수많은 노력들을 효과적으로 집중하는 것이 굉장히 중요한 요소라는 것을 시사하는 것일 것.

#llm #alignment #rlaif #pretraining #posttraining

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

(Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith)

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

토크나이저를 학습시킨 데이터셋의 비율 알아내기. 추정한 세팅들이 서로 꽤 비슷비슷하다는 생각도 드네요.

#tokenizer

DDK: Distilling Domain Knowledge for Efficient Large Language Models

(Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng)

Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.

데이터셋의 도메인을 나누고 Teacher와 Student의 차이가 가장 큰 도메인의 샘플링 비율을 높여서 Knowledge Distillation. RHO Loss 같은 느낌도 드는군요. (https://arxiv.org/abs/2206.07137)

#distillation

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

(Jiaming Shen, Ran Xu, Yennie Jun, Zhen Qin, Tianqi Liu, Carl Yang, Yi Liang, Simon Baumgartner, Michael Bendersky)

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. Unlike traditional methods, which generate two responses before obtaining the preference label, RMBoost first generates one response and selects a preference label, followed by generating the second more (or less) preferred response conditioned on the pre-selected preference label and the first response. This approach offers two main advantages. First, RMBoost reduces labeling noise since preference pairs are constructed intentionally. Second, RMBoost facilitates the creation of more diverse responses by incorporating various quality aspects (e.g., helpfulness, relevance, completeness) into the prompts. We conduct extensive experiments across three diverse datasets and demonstrate that RMBoost outperforms other synthetic preference data generation techniques and significantly boosts the performance of four distinct reward models.

모델로 응답을 샘플링한 다음 Preference Label을 하나 정해서 Preference Label을 Condition을 주고 Rewriting한 응답을 Preference Pair로 사용하는 방식. 즉 주어진 응답보다 나은/못한 응답을 재작성으로 만들어내는 방식이네요.

Llama 3가 정말 온갖 방법으로 Synthetic Preference를 만들어내는 사례를 보여줬죠. (물론 LLM Judge가 주된 방법이긴 합니다만) 이런 형태의 Synthetic Preference 방법들을 다시 생각해보면 재미있을 듯 하네요.

#rlaif #alignment

Course-Correction: Safety Alignment Using Synthetic Preferences

(Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu)

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, i.e., the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

이쪽도 Rewriting 기반 Synthetic Feedback이네요. Unsafe한 응답을 하던 중에도 Safe한 응답으로 방향을 바꾸는 능력과 관련된 문제입니다. Unsafe한 응답을 쪼개고 LLM으로 응답을 Safe하게 Rewriting한 다음 더 이른 시점의 응답을 Rewriting한 결과를 선호하도록 했네요.

#safety #rlaif #alignment