2024년 10월 24일

Oct 24, 2024

ALTA: Compiler-Based Analysis of Transformers

(Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova)

We propose a new programming language called ALTA and a compiler that can map ALTA programs to Transformer weights. ALTA is inspired by RASP, a language proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler from RASP programs to Transformer weights. ALTA complements and extends this prior work, offering the ability to express loops and to compile programs to Universal Transformers, among other advantages. ALTA allows us to constructively show how Transformers can represent length-invariant algorithms for computing parity and addition, as well as a solution to the SCAN benchmark of compositional generalization tasks, without requiring intermediate scratchpad decoding steps. We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more fine-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We make the ALTA framework -- language specification, symbolic interpreter, and weight compiler -- available to the community to enable further applications and insights.

트랜스포머로 컴파일할 수 있는 언어. RASP와 다른 점은 Universal Transformer를 타겟해서 루프를 지원한다는 점이겠군요. 트랜스포머가 거의 유일한 아키텍처가 된 상황에서는 학습이 가능한가는 차치하더라도 트랜스포머의 표현력의 한계 자체도 중요한 정보겠죠.

A language that can be compiled into transformer weights. The main difference from RASP is that it supports loops by targeting Universal Transformers. In the situations where transformers have become almost the sole architecture, understanding the limits of transformer expressiveness itself becomes crucial information, regardless of whether it's learnable or not.

#transformer

Stick-breaking Attention

(Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda)

The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point 𝛽𝑖,𝑗βi,j, which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with 211211 context window to perform well at 214214 with perplexity improvements.

Stick Breaking Attention에 대한 (https://arxiv.org/abs/2306.04640) 효율적인 구현이 등장했군요. Stick Breaking Attention은 특정 시점에서 Attention Weight가 높은 토큰이 있다면 그 이전 토큰들의 Attention Weight를 억제하는 형태죠. 다만 멀리 떨어진 토큰에 대해서 지나치게 억제될 가능성이 있지 않을까 싶네요.

An efficient implementation of Stick Breaking Attention has emerged (https://arxiv.org/abs/2306.04640). Stick Breaking Attention suppresses the attention weights of preceding tokens when there is a token with high attention weight at a certain point. However, I'm concerned that this might excessively suppress tokens that are far apart.

#attention #positional-encoding

Process Supervision-Guided Policy Optimization for Code Generation

(Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan)

Reinforcement Learning (RL) with unit test feedback has enhanced large language models (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our approach increases our in-house LLM's pass rate from 28.2% to 29.8% on LiveCodeBench and from 31.8% to 35.8% on our internal benchmark. Our experimental results highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

코드에 대한 Process Reward Model. 코드 중간을 자른 다음 나머지 부분을 생성시키고 이 생성 결과 중에서 유닛 테스트를 통과하는 결과가 나오는가로 이전 부분의 코드에 대한 보상을 부여하는 방법이네요.

다만 (수학도 그렇지만) 각 코드 Prefix에 대한 보상이라는 것 자체가 코드만으로는 부여하기 좀 어려운 대상이라는 생각도 드는군요. 코드의 각 라인의 가치는 문제를 해결하기 위한 전략과 계획에 부합하는가 아닌가에 있겠죠. 물론 이런 전략과 계획에 대해서도 평가를 할 수 있겠고, 각 코드 라인의 정확성도 평가할 수 있겠지만요.

This paper proposes a process reward model for code. The method involves splitting the code at a midpoint, generating the remainder, and then assigning rewards to the initial portion based on whether any of the generated samples pass unit tests.

However, I think that assigning rewards to each code prefix (similar to mathematics) might be challenging. The value of each line of code lies in how well it aligns with the strategies and plans for solving the problem. Of course, it's possible to evaluate these strategies and plans, as well as the accuracy of individual code lines.

#reward-model #code

2024년 10월 24일

ALTA: Compiler-Based Analysis of Transformers

Stick-breaking Attention

Process Supervision-Guided Policy Optimization for Code Generation

Discussion about this post