2024년 5월 16일

May 16, 2024

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

(Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel)

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in 10×10× data efficiency when finetuning for dense prediction tasks.

CLIP + Recaptioning된 데이터셋으로 Detection/Segmentatio/Depth Estimation 같은 Dense Prediction 과제에 대한 성능을 높일 수 있다는 결과. 최근 훨씬 더 상세한 캡션도 생성이 가능하다는 것을 고려하면 성능을 더 끌어올릴 수도 있을 것 같네요.

#clip #captioning

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models

(Siwei Wang, Yifei Shen, Shi Feng, Haoran Sun, Shang-Hua Teng, Wei Chen)

In this paper, we present the findings of our Project ALPINE which stands for ``Autoregressive Learning for Planning In NEtworks." Project ALPINE initiates a theoretical investigation into the development of planning capabilities in Transformer-based language models through their autoregressive learning mechanisms, aiming to identify any potential limitations in their planning abilities. We abstract planning as a network path-finding task where the objective is to generate a valid path from a specified source node to a designated target node. In terms of expressiveness, we show that the Transformer is capable of executing path-finding by embedding the adjacency and reachability matrices within its weights. Our theoretical analysis of the gradient-based learning dynamic of the Transformer reveals that the Transformer is capable of learning both the adjacency matrix and a limited form of the reachability matrix. These theoretical insights are then validated through experiments, which demonstrate that the Transformer indeed learns the adjacency matrix and an incomplete reachability matrix, which aligns with the predictions made in our theoretical analysis. Additionally, when applying our methodology to a real-world planning benchmark, called Blocksworld, our observations remain consistent. Our theoretical and empirical analyses further unveil a potential limitation of Transformer in path-finding: it cannot identify reachability relationships through transitivity, and thus would fail when path concatenation is needed to generate a path. In summary, our findings shed new light on how the internal mechanisms of autoregressive learning enable planning in networks. This study may contribute to our understanding of the general planning capabilities in other related domains.

Autoregressive Transformer의 계획 능력에 대한 분석. 여기서는 그래프에서 시작 노드와 도착 노드가 주어졌을 때 경로를 찾아낼 수 있는가를 계획 능력에 대한 Surrogate로 사용했습니다. Teacher Forcing의 문제를 지적한 최근 연구 (https://arxiv.org/abs/2403.06963) 가 생각나는 세팅이네요. (그래프의 형태가 달라서 필요한 알고리즘이 다르긴 합니다.)

이 연구에서는 경로를 찾기 위해 필요한 정보 중 하나가 이 노드가 다른 노드와 연결되는지에 대한 행렬인데(Reachability Matrix) High-order, 즉 A와 B가 연결되고 B가 C가 연결된다는 정보로부터 A와 C의 연결을 추정하는 것이 어렵다는 주장을 하고 있습니다. 경로를 거슬러 올라간 다음 시작점에서부터 경로를 생성하는 알고리즘을 학습하는 것이 어렵다는 위의 연구의 아이디어와 연결할 수 있을지도 모르겠네요.

#autoregressive-model

2024년 5월 16일

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models

Discussion about this post