2025년 2월 28일

Feb 28, 2025

3FS

(DeepSeek)

DeepSeek이 마지막으로 공개한 건 논문에서 종종 언급했던 분산 파일시스템인 3FS와 데이터 처리를 위한 프레임워크군요. (https://github.com/deepseek-ai/smallpond) 이걸 공개하리라고는 예상하지 못했네요.

DeepSeek's latest open-source release is their distributed filesystem 3FS, which they mentioned in their papers, along with a data processing framework. I didn't expected them to release this.

#efficiency

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

(Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu)

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by 1.96× and for end-to-end execution, COMET delivers a 1.71× speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

입력을 쪼개 MoE의 연산과 통신을 중첩하는 방법의 개선. 더 잘게 쪼개 효율성을 높이면서도 그에 따라 발생하는 난점들을 해결한 시도. 굉장히 흥미롭네요.

Improvement on overlapping computation and computation of MoE by dividing inputs. They attempted to use more fine-grained partitioning to increase efficiency while addressing the challenges that arose from it. Very interesting.

#moe #efficiency

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

(Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen)

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a k×k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20× faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2× faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.

다양한 형태의 Autoregression에 Flow Matching을 결합한 방법. 여기서는 N개 패치 단위의 Autoregression에 정착했군요.

A method combining flow matching with various forms of autoregression. They settled on autoregression at the N-patche level.

#autoregressive-model #diffusion

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

(Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie)

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images (≤ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256×256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512×512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512×512 resolution.

Visual Autoregression에서 Residual Quantization 대신 GT를 직접 예측하도록 하고 Interpolation한 Position Encoding을 사용해서 더 높은 해상도로의 일반화를 시도했네요.

The authors attempted to improve visual autoregression by let it predict GT instead of residual quantization, and enabled generalization to higher resolutions through interpolated position encodings.

#autoregressive-model #vq

UniTok: A Unified Tokenizer for Visual Generation and Understanding

(Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi)

The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.

VQ에 CLIP Loss를 걸고 표현력 제약에 코드북을 여러 개 사용하는 것으로 대응했네요.

The authors applied CLIP loss to VQ and addressed the reperesentational constraints by using multiple codebooks.

#vq #clip

2025년 2월 28일

3FS

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Discussion about this post