2024년 10월 29일

Oct 29, 2024

Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

(Michael Pieler, Marco Bellagente, Hannah Teufel, Duy Phung, Nathan Cooper, Jonathan Tow, Paulo Rocha, Reshinth Adithyan, Zaid Alyafeai, Nikhil Pinnaparaju, Maksym Zhuravinskyi, Carlos Riquelme)

Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline to the English, German, Italian, and Spanish Oscar subsets of CulturaX. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup. In addition, we provide a detailed study of our pipeline, investigating the choice of the base dataset and LLM for the rephrasing, as well as the relationship between the model size and the performance after pre-training. By exploring data with different perceived quality levels, we show that gains decrease with higher quality. Furthermore, we find the difference in performance between model families to be bigger than between different model sizes. This highlights the necessity for detailed tests before choosing an LLM to rephrase large amounts of data. Moreover, we investigate the effect of pre-training with synthetic data on supervised fine-tuning. Here, we find increasing but inconclusive results that highly depend on the used benchmark. These results (again) highlight the need for better benchmarking setups. In summary, we show that rephrasing multilingual and low-quality data is a very promising direction to extend LLM pre-training data.

프리트레이닝 코퍼스를 LLM으로 Paraphrasing 하는 것의 효과에 대한 분석. 원 데이터와 섞어주는 것이 좋다는 것과 저품질 데이터를 개선하는 것에 효과가 국한된다는 결론.

An analysis of the effectiveness of paraphrasing pretraining corpora using LLMs. The conclusions are that mixing the paraphrased data with the original dataset is beneficial, and that the effectiveness is primarily limited to improving low-quality data.

#synthetic-data

Mixture of Parrots: Experts improve memorization more than reasoning

(Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach)

The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

MoE가 지식의 기억에는 굉장히 효율적이지만 (증가한 파라미터가 Dense 모델의 파라미터 증가와 동등) 추론 능력에 대해서는 그렇지 않다는 결과. 추론 능력에는 임베딩 차원의 증가가 필요하다고 하는군요.

임베딩 차원의 증가 혹은 Attention 헤드의 증가와 동등한 효과를 낼 수 있는 Sparse 모델을 만들 수 있다면 좀 더 나아갈 수 있을지도 모르겠네요.

The results show that MoE is extremely efficient for memorizing knowledge (the increased parameters are equivalent to those of dense models), but not so much for reasoning abilities. It seems that increasing the embedding dimension is necessary for improving reasoning capabilities.

If we could create a sparse model that produces effects equivalent to increasing embedding dimensions or attention heads, we might be able to make further progress.

#moe

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

(Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, Juan-Manuel Pérez-Rúa)

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

저해상도 비디오 시퀀스에 대한 Masked Autoregression과 고해상도 시퀀스에 대한 Denoising Diffusion의 결합으로 비디오 생성 모델을 구성.

A video generation model that combines masked autoregression on low-resolution video sequences with denoising diffusion on high-resolution sequences.

#video-generation #non-autoregressive #diffusion

Generator Matching: Generative modeling with arbitrary Markov processes

(Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, Yaron Lipman)

We introduce generator matching, a modality-agnostic framework for generative modeling using arbitrary Markov processes. Generators characterize the infinitesimal evolution of a Markov process, which we leverage for generative modeling in a similar vein to flow matching: we construct conditional generators which generate single data points, then learn to approximate the marginal generator which generates the full data distribution. We show that generator matching unifies various generative modeling methods, including diffusion models, flow matching and discrete diffusion models. Furthermore, it provides the foundation to expand the design space to new and unexplored Markov processes such as jump processes. Finally, generator matching enables the construction of superpositions of Markov generative processes and enables the construction of multimodal models in a rigorous manner. We empirically validate our method on protein and image structure generation, showing that superposition with a jump process improves image generation.

임의 마르코프 과정에 대한 생성 모형. 따라서 Flow Matching이나 Diffusion 같은 Objective를 통합하는 방법입니다. 추가로 임의 마르코프 과정이니 점프 과정 같은 확률 과정을 도입할 수 있죠.

This paper presents a generative model using arbitrary Markov processes. As such, it unifies objectives like Flow Matching and Diffusion models. Furthermore, since it allows the use of arbitrary Markov processes, it's possible to incorporate stochastic processes such as jump processes.

#generative-model #diffusion

ThunderKittens: Simple, Fast, and Adorable AI Kernels

(Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré)

The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by 10-40% on attention backwards, 8× on state space models, and 14× on linear attention.

ThunderKittens의 (https://hazyresearch.stanford.edu/blog/2024-05-12-tk) 논문이 나왔군요. 실제 써보면 느낌이 어떨지 궁금하네요.

The paper on ThunderKittens has been released (https://hazyresearch.stanford.edu/blog/2024-05-12-tk). I'm curious to see how it would feel to actually use it in practice.

#efficiency

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

(Yaniv Nikankin, Anja Reusch, Aaron Mueller, Yonatan Belinkov)

Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a representative task. Using causal analysis, we identify a subset of the model (a circuit) that explains most of the model's behavior for basic arithmetic logic and examine its functionality. By zooming in on the level of individual circuit neurons, we discover a sparse set of important neurons that implement simple heuristics. Each heuristic identifies a numerical input pattern and outputs corresponding answers. We hypothesize that the combination of these heuristic neurons is the mechanism used to produce correct arithmetic answers. To test this, we categorize each neuron into several heuristic types-such as neurons that activate when an operand falls within a certain range-and find that the unordered combination of these heuristic types is the mechanism that explains most of the model's accuracy on arithmetic prompts. Finally, we demonstrate that this mechanism appears as the main source of arithmetic accuracy early in training. Overall, our experimental results across several LLMs show that LLMs perform arithmetic using neither robust algorithms nor memorization; rather, they rely on a "bag of heuristics".

LLM에서 계산이 값의 범위 체크 같은 휴리스틱의 결합으로 수행되는 것 같다는 분석.

Analysis suggests that arithmetic operations in LLMs are performed through a combination of heuristics, such as value range checks.

#transformer #mechanistic-interpretation

Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

(Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, Xiaoxing Ma)

Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0.22-1.35x relative improvements across various LLMs and baseline methods.

Autoformalization을 위한 Self Consistency 방법 연구. 논리적으로 같은가를 Proof Assistant로 체크, Informal하게 다시 변환했을 때 의미적으로 원문과 같은가를 체크.

Proof Assistant를 Reward로 활용하려는 시도에서 Autoformalization은 같이 등장하는 주제이지만 이 방법이 정말로 나아갈 방향인가에 대한 고민은 필요할 것 같네요.

A study on self-consistency methods for autoformalization. It checks whether statements are logically equivalent using a proof assistant, and whether they are semantically similar to the original text when converted back to informal language.

Autoformalization is a topic that often appears alongside attempts to use proof assistants as a reward function. However, we need to carefully consider whether this approach is truly the right direction for teaching reasoning to models.

#reasoning #math

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

(Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu)

Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.

RoPE에서 저주파수 영역을 빼는 쪽이 더 낫다는 연구. 얼마 전 나온 연구와 통하는군요. (https://arxiv.org/abs/2410.06205) 중요한 부분은 Attention에 대한 Long term decay가 필요하지 않은 것 같다는 것일 것 같네요.

This study suggests that removing the low-frequency components from RoPE leads to better results. It aligns with a recent study (https://arxiv.org/abs/2410.06205). The key takeaway seems to be that long-term decay in attention might not be necessary.

#positional-encoding #long-context

On Inductive Biases That Enable Generalization of Diffusion Transformers

(Jie An, De Wang, Pengsheng Guo, Jiebo Luo, Alexander Schwing)

Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: dit-generalization.github.io/.

UNet에서는 Inductive bias로 인해 진동수에 따라 정렬된 기하학적 패턴이 나타나지만 DiT에서는 잘 나타나지 않는다는 분석. 따라서 Local Attention으로 Inductive bias를 주입하면 DiT의 일반화 성능이 개선된다는 결과. ViT의 초창기에 많이 했던 작업이죠. 요즘은 Inductive bias를 주입한다고 하면 다들 꺼릴 것 같긴 합니다만.

Analysis shows that while UNet exhibits geometric patterns aligned with frequencies due to inductive biases, these patterns are not as evident in DiT. Consequently, injecting inductive bias through local attention improves DiT's generalization performance. This approach is reminiscent of early work on ViT. However, I suspect that nowadays, many researchers might be wary of intentionally injecting inductive bias.

#diffusion #transformer #inductive-bias

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

(Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel Goldstein)

Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

트랜스포머 내에서 Top-1 토큰을 결정하고 이 결정이 바뀌지 않는 지점이 있고, 그 다음으로 Top-2가 결정되는 식으로 순서대로 토큰 순위가 결정된다는 분석. 마치 과제를 순차적으로 해결하듯 Top-1 토큰 결정 과제를 수행 완료 했다는 신호가 발생하면 Top-2 결정 과제를 수행하는 식으로 작동하는 것 같다는 분석이네요.

This analysis shows that in transformers, there's a point where the top-1 token is determined and doesn't change afterwards. Then, the top-2 token is determined, and so on, with token rankings being decided sequentially. The authors suggest that this process operates like solving tasks in order: once a signal indicates that the task of determining the top-1 token is complete, the model moves on to the task of determining the top-2 token, and so forth.

#transformer

2024년 10월 29일

Discussion about this post