2024년 11월 4일 (2)

Nov 04, 2024

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

(Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa)

Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique torch.compile-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.

새로운 PyTorch FSDP 구현이군요. DTensor를 기반으로 TorchInductor에 통신 최적화를 위한 Bucketing과 Reordering을 추가하는 형태로 구현됐습니다. Jax와 보다 비슷한 느낌이 됐네요.

New PyTorch FSDP implementation. It's built on DTensor and communication optimizations through bucketing and reordering in TorchInductor. This makes FSDP in PyTorch more similar to Jax.

#efficiency #parallelism

Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

(Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang)

Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.

모델에서 샘플링을 한 다음 정답인 샘플들만 필터링해서 학습한다면 정답일 확률이 높은 쉬운 데이터로 학습 데이터가 편향될 가능성이 높겠죠. 샘플링-필터링을 사용하는 방법에서는 중요한 문제일 듯 하네요. 해결 방법으로 제시한 것은 난이도가 높은 샘플에 대한 성공 확률을 높이기 위해 Privileged Information을 사용하는 방법들입니다.

If we sample from the model and then filter to keep only the correct samples for training, the training data is likely to become biased towards easier data with a higher probability of being correct. This seems to be a significant issue for methods that use sampling and filtering. The solution proposed is to use privileged information to increase the success rate for more difficult samples.

#synthetic-data

Self-Evolved Reward Learning for LLMs

(Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang)

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model's responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).

Pseudo Labeling을 통한 Reward Model 학습이군요.

Training reward models using pseudo labeling.

#synthetic-data #reward-model

Randomized Autoregressive Visual Generation

(Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen)

This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer

Autoregressive 이미지 생성에서 이미지 시퀀스의 Permutation을 통해 학습시키면 성능이 향상된다는 연구. 학습 과정에서 Permutation한 이미지에 노출됐다는 것이 생성에 어떤 영향을 미치는지 연결하기가 쉽지는 않네요. (다양한 배치와 패턴의 패치를 통해 좀 더 강인한 거동을 학습할 수 있다는 생각은 듭니다.)

This study shows that training an autoregressive image generation model on permutations of image sequences can enhance its performance. It's not immediately clear how exposure to permuted images during training affects the generation process. (However, I think training on various arrangements and patterns of image patches might lead to more robust behavior of the model.)

#image-generation #autoregressive-model

2024년 11월 4일 (2)

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

Self-Evolved Reward Learning for LLMs

Randomized Autoregressive Visual Generation

Discussion about this post