2024년 11월 8일

Nov 08, 2024

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

(Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster)

Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250% speedup and a 54× space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.

효율적인 Deduplication을 위한 방법. MinHashLSH의 Signature Matrix의 밴드들을 블룸 필터로 대체한 방법이군요. 프리트레이닝 코퍼스는 비용이 더 들더라도 조금이라도 나은 방법을 쓸 것 같기도 하지만...FineWeb이나 DCLM의 경우에도 Deduplication 부분에서 효율성을 위한 타협을 했었죠.

Method for efficient deduplication. It replaces the bands of MinHashLSH's signature matrix with Bloom filters. While it might seem natural to use the most accurate method for preprocessing pretraining corpora regardless of cost, even projects like FineWeb and DCLM have made a choice for efficiency in their deduplication processes.

#corpus

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

(Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin)

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

동일한 트랜스포머 레이어로 모든 모달리티를 커버하게 하는 경우에도 각 모달리티에 따라 임베딩이 독립적으로 움직이니, 각 모달리티에 따라 서로 다른 트랜스포머 레이어를 적용하고 모달리티 사이의 Global Attention을 추가한 구조. 좀 더 극단적으로 나아간 Modality Expert 구조라고 생각할 수 있겠네요.

Even when the same transformer layers are applied to all modalities, embeddings behave independently for each modality. This architecture applies different transformer layers for each modality and adds global attention between modalities. It can be considered a more extreme version of the modality expert architecture.

#moe #multimodal #transformer

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

(Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla)

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

이젠 아예 고난이도를 목표로 하는 수학 문제들이 벤치마크로 등장하기 시작했네요.

Now we're seeing the emergence of mathematical benchmarks specifically designed to be highly challenging.

#math #benchmark

Scaling Laws for Precision

(Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan)

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.

Quantized Training과 PTQ에 대한 Scaling Law. 흥미로운 점이 많습니다.

PTQ의 경우 프리트레이닝을 오버트레이닝을 할수록 정밀로 감소에 따른 Loss 증가가 커진다.
Quantized Training의 경우 모델 크기, 데이터, 정밀도를 모두 조정해서 연산 최적인 지점을 찾는다면 더 낮은 정밀도를 사용할 수 있다. 그러나 FP4 이하는 손실이 커진다.
모델 크기를 고정하고 데이터 크기를 증가시킨다면 정밀도도 증가시키는 것이 최적일 수 있다.

Scaling law for quantized training and post training quantization. There are a lot of interesting points.

Scaling law for quantized training and post-training quantization. There are many interesting points:

For PTQ, the more we overtrain during pretraining, the greater the increase in loss due to reduced precision.
For quantized training, if we optimize model size, data, and precision simultaneously to find the most computationally efficient point, we can use lower precision. However, for FP4 or below it, the loss increases significantly.
If we keep the model size fixed and increase the data size, it may be optimal to also increase precision.

#quantization #scaling-law

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

(Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, Huan Wang)

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

CoT를 샘플링한 다음 CoT가 주어졌을 때 정답의 Likelihood 변화를 Reward로 사용해 RLOO을 진행. 정답 대신 Likelihood를 사용한다면 정답의 정확함을 체크하는 루틴을 작성하는 것을 피할 수 있다는 측면에서 유용할 수 있다고 봅니다.

The method samples multiple CoTs and then performs RLOO using the change in likelihood of the answer given the CoT as a reward. Using likelihood instead of accuracy could be advantageous as it eliminates the need to create a routine for checking the correctness of answers. This would be particularly useful when dealing with informal or open-ended responses.

#reasoning

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

(Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J.H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu)

Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.

코드 모델. GitHub 기반 코드 데이터 구축을 직접 다시 하고, DeepSeekMath 스타일의 코드 관련 문서 추출, 그리고 합성 데이터를 추가했군요.

Code LLM. They rebuilt the code dataset from GitHub, extracted code-related documents in a style similar to DeepSeekMath, and incorporated synthetic data.

#code #llm