2024년 7월 30일

Jul 30, 2024

Apple Intelligence FoundationLanguage Models

(Apple)

We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b]. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b]. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

애플의 LLM. 온디바이스 모델과 서버 모델로 나뉘는데 온디바이스 모델에 대해서는 크기를 공개했지만(3B) 서버 모델에 대해서는 언급이 없네요.

생각보다 특이한 디테일들이 많이 있습니다. QK Norm을 사용했고 자체 크롤러로 수집한 데이터로 LLM 기반 필터링도 사용했군요. (https://arxiv.org/abs/2406.04638) 수학 데이터 추가 수집을 진행했고 라이선스를 받은 데이터셋을 사용했다는데 이쪽이 Long Context 학습에 들어간 것을 보면 대략 어떤 데이터인지 알 수 있을 것 같네요.

Simple μP를 사용해서 768 dim 모델의 LR을 Transfer 했고 그 결과 LR이 0.1배가 되었다는 언급을 고려하면 서버 모델은 아마 30 - 70B 모델이 아닐까 싶네요. 서버와 온디바이스 모델 모두 6.3T 학습했는데 온디바이스 모델은 6.4B 모델을 프루닝하고 Knowledge Distillation을 했군요. Independent Weight Decay를 사용했고 서버 모델은 배치 크기 추정으로 16M 배치를 사용했습니다. 정확하게는 12M 배치가 최적으로 나왔지만 2배 정도 까지는 큰 문제가 없어서 16M을 사용했다고.

RMSprop + Momentum, eps = 1e-30. 업데이트 자체의 Norm을 1로 제한. TPU를 사용해 학습했습니다.

포스트트레이닝. 수학에 대해서는 Evol Instruct 스타일의 프롬프트 확장. 코딩에 대해서는 Self Instruct 스타일로 프롬프트를 생성하고 Execution Feedback. Llama 2의 Margin Based Loss와는 좀 다르지만 Preference의 강도에 따라 변화하는 Soft Label Loss와 Single Sided Grading Loss.

Human Preference 데이터 수집 과정에서 SFT, Rejection Sampling, DPO/IPO, RL로 학습된 여러 모델들의 샘플에 대해서 어노테이션을 하고 이를 통해 학습한 Reward Model을 이 여러 모델들의 샘플에 적용해서 Rejection Sampling을 하고 이 결과를 사용해 추가 학습. 여러 스타일로 학습된 모델들의 장점을 모으려고 했다고.

RLHF는 PPO 대신 Mirror Descent 기반으로 (https://arxiv.org/abs/2005.09814) Leave One Out Estimator를 (https://arxiv.org/abs/2402.14740) 사용했습니다. 전부 하나씩 다르네요.

Quantization을 상당히 강력하게 하면서 LoRA를 사용해 성능 감소를 커버하는 접근을 썼군요. 평균적으로 Weight 당 3.7 bit를 썼다고 합니다.

#llm

SAM 2: Segment Anything in Images and Videos

(Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer)

We present Segment Anything Model 2 (SAM 2 ), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3× fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6× faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.

https://ai.meta.com/blog/segment-anything-2

SAM 2가 나왔군요. 비디오에 대한 확장입니다. 이미지 Feature와 출력 마스크에 대한 정보를 메모리에 저장하고 꺼내오는 방식입니다.

비디오에 대한 Interactive Segmentation이 이렇게 간명한 방식으로 풀렸다는 게 굉장하네요. 물론 데이터 구축에 투입한 노력이 그만큼 굉장하겠습니다만.

#video-segmentation

ByteCheckpoint: A Unified Checkpointing System for LLM Development

(Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu)

The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, significantly diminishing training efficiency. Besides, when transferring checkpoints across tasks, checkpoint resharding, defined as loading checkpoints into parallel configurations differing from those used for saving, is often required according to the characteristics and resource quota of specific tasks. Previous checkpointing systems [16,3,33,6] assume consistent parallel configurations, failing to address the complexities of checkpoint transformation during resharding. Furthermore, in the industry platform, developers create checkpoints from different training frameworks[23,36,21,11], each with its own unique storage and I/O logic. This diversity complicates the implementation of unified checkpoint management and optimization. To address these challenges, we introduce ByteCheckpoint, a PyTorch-native multi-framework LLM checkpointing system that supports automatic online checkpoint resharding. ByteCheckpoint employs a data/metadata disaggregated storage architecture, decoupling checkpoint storage from the adopted parallelism strategies and training frameworks. We design an efficient asynchronous tensor merging technique to settle the irregular tensor sharding problem and propose several I/O performance optimizations to significantly enhance the efficiency of checkpoint saving and loading. Experimental results demonstrate ByteCheckpoint's substantial advantages in reducing checkpoint saving (by up to 529.22X) and loading (by up to 3.51X) costs, compared to baseline methods.

모델 규모가 커지고 다양한 병렬화 전략이 사용되면서 체크포인트를 빠르게 저장 및 로딩하고 병렬화 전략들에 대해서 유연하게 대응할 수 있게 만드는 구조의 필요성이 높아졌네요.

#infrastructure

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

(Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia)

To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the optimal data mixtures when scaling up to the final model. To address this challenge, we propose AutoScale, an automated tool that finds a compute-optimal data composition for training at any desired target scale. AutoScale first determines the optimal composition at a small scale using a novel bilevel optimization framework, Direct Data Optimization (DDO), and then fits a predictor to estimate the optimal composition at larger scales. The predictor's design is inspired by our theoretical analysis of scaling laws related to data composition, which could be of independent interest. In empirical studies with pre-training 774M Decoder-only LMs (GPT-2 Large) on RedPajama dataset, AutoScale decreases validation perplexity at least 25% faster than any baseline with up to 38% speed up compared to without reweighting, achieving the best overall performance across downstream tasks. On pre-training Encoder-only LMs (BERT) with masked language modeling, DDO is shown to decrease loss on all domains while visibly improving average task performance on GLUE benchmark by 8.7% and on large-scale QA dataset (SQuAD) by 5.9% compared with without reweighting. AutoScale speeds up training by up to 28%. Our codes are open-sourced.

데이터셋의 비율을 결정하기 위한 방법. 기본적인 아이디어는 Scaling Law에서 각 데이터셋의 학습량 N의 차이를 둔 두 개의 학습 결과를 비교해서 데이터셋을 사용하지 않은 경우의 Loss와 (즉 다른 데이터셋이 이 데이터셋의 Loss 감소에 준 영향) 데이터셋을 학습시켰을 때의 Loss를 추정하는 방법을 기반으로 합니다. 이를 통해 타겟 Objective인 모든 Validation 셋에 대한 Loss의 합을 최적화 하는 방법으로 데이터셋 비율을 결정합니다.

#dataset #scaling-law

VILA^2: VILA Augmented VILA

(Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin)

Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce VILA^2 (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, and achieves new state-of-the-art results on MMMU leaderboard among open-sourced models.

Vision Language 모델로 Recaptioning한 데이터를 프리트레이닝에 사용한 다음 다시 Recaptioning 하는 방법. 프롬프팅으로 더 상세한 캡션을 생성하는 것, 원 캡션을 보존하는 것, 그리고 각 과제에 대한 Specialist를 만들어서 캡션을 다양화하는 것 등이 전략이군요.

#vision-language #synthetic-data

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

(Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar)

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

Self-rewarding처럼 (https://arxiv.org/abs/2401.10020) 모델 하나에 Actor와 Judge의 역할을 부여해 Actor의 생성 결과를 Judge로 분류해 학습시키는 방법에서, 추가적으로 Judge를 학습시키기 위해 Judge의 결과를 분류하는 Meta Judge 역할을 추가한 방법.

#rlaif #synthetic-data