2024년 7월 12일

Jul 12, 2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

(Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao)

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0 with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6 lower numerical error than a baseline FP8 attention.

FlashAttention 3가 나왔군요. H100에서 FP16으로 75% Util을 달성했습니다. FP8은 여기에서 1.6배인 1.2 PFLOPS가 나오네요.

WGMMA와 TMA가 핵심인데 ThunderKittens (https://hazyresearch.stanford.edu/blog/2024-05-12-tk) 에서 저자들이 고통받았던 바로 그 오퍼레이션들이군요. 물론 Hopper 전용이긴 합니다.

#efficiency

Arena Learning : Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

(Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, Weizhu Chen)

Recent work demonstrates that, post-training large language models with instruction following data have achieved colossal success. Simultaneously, human Chatbot Arena has emerged as one of the most reasonable benchmarks for model evaluation and developmental guidance. However, on the one hand, accurately selecting high-quality training sets from the constantly increasing amount of data relies heavily on intuitive experience and rough statistics. On the other hand, utilizing human annotation and evaluation of LLMs is both expensive and priority limited. To address the above challenges and build an efficient data flywheel for LLMs post-training, we propose a new method named Arena Learning, by this way we can simulate iterative arena battles among various state-of-the-art models on a large scale of instruction data, subsequently leveraging the AI-anotated battle results to constantly enhance target model in both supervised fine-tuning and reinforcement learning. For evaluation, we also introduce WizardArena, which can efficiently predict accurate Elo rankings between different models based on a carefully constructed offline testset, WizardArena aligns closely with the LMSYS Chatbot Arena rankings. Experimental results demonstrate that our WizardLM-β trained with Arena Learning exhibit significant performance improvements during SFT, DPO, and PPO stages. This new fully AI-powered training and evaluation pipeline achieved 40x efficiency improvement of LLMs post-training data flywheel compare to LMSYS Chatbot Arena.

WizardLM 2의 핵심 기술에 대한 논문이 나왔군요. 기본적인 아이디어는 Chatbot Arena를 시뮬레이션하는 것이네요. GPT-4 같은 모델들과 맞붙게 한 다음 Llama 3 70B를 Judge로 해서 평가하고, 이 평가 데이터를 다시 학습 데이터로 사용하는 흐름입니다.

#alignment #evaluation

2024년 7월 12일

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Arena Learning : Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Discussion about this post