2024년 11월 22일

Nov 22, 2024

Hymba: A Hybrid-head Architecture for Small Language Models

(Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov)

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Mamba와 Self Attention을 병렬로 붙였군요. 아이디어는 특정한 위치의 레이어에서 일어나야 할 정보 처리에 적합한 것이 Mamba인지 Self Attention인지를 구분해야 한다는 문제를 해소한다는 것. 추가적으로 메타 토큰이라는 이름으로 Attention Sink/레지스터 토큰을 사용했군요. 거기에 Global/Sliding window Attention과 KV 캐시 공유까지.

하이브리드 모델들은 이미 꽤 괜찮은 것이 아닌가 싶습니다.

The authors suggested to concatenate Mamba and self attention in parallel. The idea is to address the challenge of determining whether Mamba or self attention is more suitable for processing information at specific layer positions. Additionally, they've used what they call "meta tokens," which are attention sink or register tokens. They've also incorporated global/sliding window attention and KV cache sharing.

I think these hybrid models are already quite promising.

#state-space-model #transformer

Multimodal Autoregressive Pre-training of Large Vision Encoders

(Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby)

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Vision-Language 모델에서 이미지 패치에는 Prefix LM Objective로 픽셀 예측을, 캡션에 대해서는 Autoregressive Loss를 사용.

Vision-Language 모델에서 텍스트는 이미지에 대해 입출력 인터페이스를 제공한다는 느낌이죠. 이미지에 대해서는 이미지에서 더 많은 정보를 끌어내기 위한 방법이 있을 것이라는 생각입니다. 다만 그 정보를 어떻게 인터페이스와 연결할 것인가는 다른 문제겠죠. (물론 모델이 커지면 이 두 모달리티를 알아서 정렬할 수도 있을 것 같긴 합니다.)

Vision-language models which use a prefix LM objective with pixel prediction for image patches and an autoregressive loss for captions.

In vision-language models, text seems to provide an input/output interface for images. I think there might be methods to extract more information from images. However, how to connect that information with the interface is a separate issue. (Of course, as models become larger, they might be able to align these two modalities on their own.)

#multimodal #autoregressive-model

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

(Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi)

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With TÜLU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the TÜLU 3 model weights and demo, we release the complete recipe – including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the TÜLU 3 approach to more domains.

TULU 3가 나왔군요. 새삼스럽지만 지금 시점의 포스트트레이닝은 굉장히 다방면에 대해서 직접 태클해나가야 하는 작업이 되었다는 느낌이 드네요. 프롬프트 큐레이션 등이 흥미롭습니다.

이미 정렬된 모델을 통해 생성한 데이터로 정렬하는 것을 그리 선호하지는 않지만 이런 형태의 데이터 구축 방법은 모델을 부트스트랩할 때에도 여전히 사용될 수 있겠죠. 그런 의미에서는 계속 체크해야 하는 듯 싶네요.

TULU 3 has been released. Unsurprisingly, post-training at this point has become a task that requires directly tackling various aspects. The prompt curation, among other things, is particularly interesting.

While I'm not particularly fond of aligning models using data generated from already aligned models, this type of data construction method can still be useful when bootstrapping models. In that sense, it seems we should continue to follow on these approaches.

#alignment

2024년 11월 22일

Hymba: A Hybrid-head Architecture for Small Language Models

Multimodal Autoregressive Pre-training of Large Vision Encoders

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

Discussion about this post