2024년 5월 28일

May 28, 2024

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

(Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski)

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of �k-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

Self Supervised Learning을 위한 데이터셋은 크고 다양하며 균형이 잡혀 있어야 한다, 즉 특정한 컨셉이 지나치게 많은 비중을 차지해서는 안 된다는 가정 하에서 균형을 어떻게 맞출 것인가에 대한 문제입니다.

클러스터링을 한 다음 각 클러스터에서 고정된 숫자의 샘플을 뽑는다는 것이 바로 생각할 수 있는 방법이지만 균형이 깨진 데이터셋에 대해서는 K-means 같은 경우 같은 컨셉의 클러스터가 여러 개 발생하는 문제가 있죠. 그래서 반복적으로 K-means를 사용해 계층화된 클러스터를 만들고 이 클러스터 계층을 사용해 샘플링을 하는 방법을 사용했습니다.

Semantic Deduplication의 더 체계화된 방법이라고 할 수 있겠군요.

#dataset

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

(Cristina N. Vasconcelos, Abdullah Rashwan Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang)

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

구글의 픽셀 단위 Diffusion. 16x16 그리드에서 작동하는 Transformer 레이어를 코어 모듈로 설정한 다음 앞뒤로 인코더/디코더를 붙입니다. 작은 해상도에서 학습시킨 다음 고해상도로 넘어갈 때는 코어 모듈을 유지하고 인코더/디코더를 갈아치우는 방식으로 학습시키네요.

#diffusion

MoEUT: Mixture-of-Experts Universal Transformers

(Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D. Manning)

Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on language modeling tasks such as BLiMP and PIQA, while using significantly less compute and memory.

σ-MoE + SwitchHead 기반 Universal Transformer. Sparse Universal Transformer (https://arxiv.org/abs/2310.07096) 와 비슷한데 디테일이 다릅니다.

Language Modeling에서 베이스라인 트랜스포머에 대해 경쟁력 있는 성능을 갖고 있다는 것이 메인 포인트네요. 역으로 Compositional Generalization에 대한 결과는 없는데 그 부분에서 어떤 특성이 나올지도 궁금합니다.

#moe #transformer

2024년 5월 28일

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

MoEUT: Mixture-of-Experts Universal Transformers

Discussion about this post