2024년 7월 16일

Jul 16, 2024

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

(Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei)

We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.

Top-K로 Activation을 Sparse하게 만드는 방법 + Quantization. Bitnet b1.58도 아직 잘 검증되지 않은 상태에서 추가적인 Activation Sparsity가 올라간 결과가 나온 형태라 좀 묘하긴 하네요.

#quantization #sparsity

Qwen2 Technical Report

(An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan)

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face1 and ModelScope2, and the supplementary materials including example code on GitHub3. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

Qwen 2 테크니컬 리포트. Qwen 1.5에서 실험했던 DeepSeek 스타일의 MoE 모델도 같이 만들었군요.

데이터를 7T 수준으로 증가시켰습니다. 퀄리티 필터링을 완화해서 12T 정도로 확장해보기도 했는데 큰 효과는 없었다고.

포스트트레이닝 측면에서는 사람의 어노테이션을 줄이기 위한 작업들을 많이 했습니다. 얼마 전 나왔던 연구도 (연구에서 밝혔듯) 레시피의 일부네요. (https://arxiv.org/abs/2406.13542)

#llm

LLM Circuit Analyses Are Consistent Across Training and Scale

(Curt Tigges, Michael Hanna, Qinan Yu, Stella Biderman)

Most currently deployed large language models (LLMs) undergo continuous training or additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on models at one snapshot in time (the end of pre-training), raising the question of whether their results generalize to real-world settings. Existing studies of mechanisms over time focus on encoder-only or toy models, which differ significantly from most deployed models. In this study, we track how model mechanisms, operationalized as circuits, emerge and evolve across 300 billion tokens of training in decoder-only LLMs, in models ranging from 70 million to 2.8 billion parameters. We find that task abilities and the functional components that support them emerge consistently at similar token counts across scale. Moreover, although such components may be implemented by different attention heads over time, the overarching algorithm that they implement remains. Surprisingly, both these algorithms and the types of components involved therein can replicate across model scale. These results suggest that circuit analyses conducted on small models at the end of pre-training can provide insights that still apply after additional pre-training and over model scale.

트랜스포머 LM이 특정 능력을 학습하는 시점(학습량)이 모델 크기와 관계 없이 일정했다는 결과. 추가적으로 Attention Head 등의 기능이 학습에 따라 변화하지만 모델이 학습한 알고리즘은 학습 기간 동안 일정했다고.

흥미롭네요. 창발 현상과 관련지어 생각해보면 재미있지 않을까 싶습니다.

#transformer #llm #mechanistic-interpretation

Lean-STaR: Learning to Interleave Thinking and Proving

(Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck)

Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models (43.4% → 46.3%, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.

Lean으로 증명을 하는 과정에서 형식 언어만 생성하는 것보다는 자연어로 생각하는 단계가 들어가면 좋지 않을까 하는 아이디어. 여기서 생각을 생성하는 것이 문제인데 생각을 바로 생성하는 것은 어렵지만 다음 증명 단계를 안다면 좀 더 할만하다는 아이디어입니다.

이 과정에서 더 강력한 모델(GPT-4)의 도움을 받긴 했습니다. 다만 이 과정 자체가 요즘 이야기가 나오는 행간을 생성하는 것과 관련이 있겠다 싶네요. 행간 생성과 자동화된 피드백이 좀 다른 아이디어이지만 같이 결합될 수 있겠다는 사례일 것 같습니다.

#math #prompt #search

2024년 7월 16일

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Qwen2 Technical Report

LLM Circuit Analyses Are Consistent Across Training and Scale

Lean-STaR: Learning to Interleave Thinking and Proving

Discussion about this post