2024년 10월 8일
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
(Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li)
The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
이건 상당히 신기하네요. Attention Score에 Convolution을 한 결과를 Attention Bias로 쓰는 형태인데 128/512 Sequence Length에 대해 학습한 모델을 8K로 Extrapolation을 하는데 성공했네요. Positional Encoding에서 이런 정도의 개선이 있었던 적이 없는 것 같아서 계속 결과가 맞는지 확인하게 되네요.
This is quite suprising. The method applies convolution to the attention scores and uses the result as an attention bias. Remarkably, they've succeeded in extrapolating models trained on 128/512 sequence lengths to 8K. I find myself double-checking the results because I haven't met such significant improvements in positional encoding before.
#long-context #attention #positional-encoding
Differential Transformer
(Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei)
Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.
불필요한 토큰에 Attention Weight가 주어지는 현상이 문제라는 아이디어. 이에 대해 Attention Score의 차이를 사용하는 방법을 생각했군요. Attention의 엔트로피가 문제라는 것은 이전에도 제기된 문제였고 (https://arxiv.org/abs/2308.16137) 최근에도 이야기가 나오고 있죠. (https://arxiv.org/abs/2410.01104) 이에 대한 해결책은 뚜렷하지 않았는데 한 가지 방법이 등장했군요.
This paper addresses the issue of attention weight being allocated to unnecessary tokens. They've come up with a method that uses the difference between attention scores. The problem of attention entropy has been raised before (https://arxiv.org/abs/2308.16137) and is still being discussed recently (https://arxiv.org/abs/2410.01104). While there hasn't been a clear solution to this problem, this paper seems to present one approach.
#transformer #long-context #attention
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
(Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma)
Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
Warmup-Stable-Decay LR 스케줄에 대한 분석. Stable LR에서는 LR이 높기 때문에 골짜기로 진입하지 못하고 산 위에 있지만 여전히 전진은 하고 있는 상황이고, Decay 페이즈에서 LR이 낮아지면서 전진한 위치에서 골짜기로 진입한다는 형태의 그림입니다. 이 아이디어에 기반해서 WSD로 Continual Pretraining을 할 때 굳이 Stable 페이즈의 체크포인트에서 시작할 필요 없이 Decay 체크포인트를 가져와 Stable LR로 바로 학습시켜도 된다는 결과를 냈네요.
여기서는 이 산과 골짜기 형태의 Loss Landscape가 발생하는 이유를 토큰의 Uncertainty와 연결했네요. 산 방향은 Uncertainty가 낮은 방향이고 골짜기 방향은 Uncertainty가 높은 방향이라는 아이디어입니다. 그렇게 생각하면 어떤 LR 스케줄을 사용하는가가 모델이 학습한 내용을 변화시킬 수 있다고 생각할 수도 있을 것 같네요. 그게 무엇인지는 수치로는 잘 드러나지 않습니다만.
Analysis of the Warmup-Stable-Decay LR schedule. During the stable LR phase, the high learning rate prevents the model from entering the valley, keeping it on top of the mountain, but it still progresses. In the decay phase, as the LR decreases, the model enters the valley from its advanced position on the mountain. Based on this idea, they found that when doing continual pretraining with WSD, it's not necessary to start from a checkpoint in the stable phase. Instead, we can use a checkpoint from the decay phase and immediately train with a stable LR.
The authors connect the mountain and valley shape of the loss landscape to token uncertainty. They propose that the mountain direction corresponds to low uncertainty, while the valley direction corresponds to high uncertainty. From this perspective, we might consider that the choice of LR schedule could influence what the model learns. However, this difference is not clearly reflected in metrics.
#optimization