2024년 9월 26일

Sep 26, 2024

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

(Meta)

Multimodal Llama 3와 Lightweight Llama 3. 멀티모달 모델의 경우 논문에서 예고했던 것처럼 (https://arxiv.org/abs/2407.21783) CLIP을 Cross Attention으로 붙인 모델이군요.

경량 모델인 1B, 3B 모델의 경우에는 8B 모델을 Pruning한 다음 Knowledge Distillation으로 학습시킨 모델입니다. NVIDIA가 요즘 많이 하는 작업이죠. (https://arxiv.org/abs/2408.11796) (물론 다들 하고 있는 작업이긴 하겠습니다만.)

#multimodal #distillation

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

(Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu)

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

이거 꽤 흥미롭네요. 프리트레이닝 데이터의 전처리를 코드 형태로 수행합니다. (라인 삭제, 문서 삭제 등) 그리고 이 코드를 작은 LM을 사용해 생성하네요. 필터링 외에도 편집이 가능하다는 점에서 재미있는 방법인 듯 합니다.

#pretraining #corpus

Counterfactual Token Generation in Large Language Models

(Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, Manuel Gomez-Rodriguez)

"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-instruct and conduct both qualitative and quantitative analyses of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.

Gumbel Max 트릭을 사용해 프롬프트의 일부가 바뀌었을 때 결과가 어떻게 바뀌었을지를 시뮬레이션. 예컨대 인물의 성별을 바꿨을 때 그에 대한 설명이 어떻게 바뀌는지를 시뮬레이션 할 수 있죠.

#language-generation

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

(Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi)

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at
https://molmo.allenai.org.

https://molmo.allenai.org/blog

Allen AI의 멀티모달 모델. 이미지-텍스트 정렬에 사용할 캡션 데이터를 특이한 방법으로 모았네요.

일단 기존의 멀티모달 모델을 사용해서 생성하는 것은 피하려고 시도했습니다. 그런데 사람을 통해 수집하자니 작업이 쉽지 않아 선택한 것이 텍스트로 쓰는 대신 이미지에 대한 설명을 구술하게 했네요. 그리고 이를 ASR로 변환한 다음 LLM으로 텍스트를 교정했습니다.

SFT용으로는 시계 읽기나 Visual Grounding과 같은 과제들을 추가했군요. 평가에서도 15K의 이미지 쿼리들을 수집해 사람을 통해 평가했습니다.

상당히 공들인 프로젝트네요. 특히 텍스트로 쓰게 하는 대신 말로 설명하게 한다는 아이디어는 굉장히 흥미롭습니다.

#dataset #vision-language #captioning

2024년 9월 26일

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Counterfactual Token Generation in Large Language Models

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Discussion about this post