2024년 11월 5일

Nov 05, 2024

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

(Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Peijie She, Ze Zhao, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, Jie Jiang)

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

텐센트의 52B Activated 389B Weight MoE 모델. 웹 데이터를 Instruction으로 만든 합성 데이터를 1.5T 사용했군요. KV Cache 공유, 1 Shared Expert, 16 Expert 중 Top-1 Expert 사용. Token Dropping 대신 Capacity가 남는 Expert로 라우팅하는 방법 사용. 추가적으로 MoE에 대한 Scaling Law 추정. 52B/7T 학습이라는 세팅 자체가 Chinchilla Optimal에 가깝군요. (58B, 5.6T)

Tencent's 52B Activated, 389B Weight MoE model. They used 1.5T of synthetic data created by converting web data into instructions. The model employs KV cache sharing, 1 shared expert, and selects the top-1 expert out of 16 experts. Instead of token dropping, they route tokens to experts with remaining capacity. Additionally, they estimated scaling laws for MoE. The 52B/7T training configuration is close to the Chinchilla optimal (58B, 5.6T).

#llm #synthetic-data

Adaptive Length Image Tokenization via Recurrent Allocation

(Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman)

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

가변 길이 이미지 Tokenization. VQ로 추출한 2D 코드와 추가 1D 코드를 인코더 입력으로 주고 새로운 코드를 출력하는 것을 반복하는 방식이군요.

Variable-length image tokenization. It uses an iterative approach where the encoder takes 2D codes extracted by VQ and additional 1D codes as input, and repeatedly outputs new codes.

#vq

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning

(Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal)

Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging 5×5 integer multiplication task, our approach achieves 99.5% exact match accuracy, outperforming models of the same size (which yield 0% accuracy) and GPT-4 with five-shot CoT prompting (44%). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.

계산 과제를 트랜스포머가 잘 학습하지 못하는 이유가 Representation의 다양성이 감소하는 경향 때문이라는 주장. 이에 대해 Regularization을 걸어주는 방법으로 문제를 해소했습니다. Representation Collapse가 왜 일어나는 걸까요? 흥미로운 문제 같습니다.

The paper argues that the reason transformers struggle with computational tasks is due to a tendency for decreased diversity in their representations. They addressed this issue by applying regularization to maintain representation diversity. Why does representation collapse occur? This seems like an interesting question to explore.

#regularization #reasoning #transformer

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

(Yongxin Zhu, Bocheng Li, Yifei Xin, Linli Xu)

Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose SimVQ, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the entire linear space spanned by the codebook, rather than merely updating the code vector selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at https://github.com/youngsheen/SimVQ.

VQ 학습의 문제가 코드북의 일부만 업데이트 되는 것에 있다는 아이디어. 해결 방법은 코드북 C만 학습하는 것이 아니라 새로운 행렬 W를 도입해서 CW를 학습하는 것입니다. 단순하면서도 강력하군요.

The problem with training VQ models lies in updating only a partial set of the codebook entries. The solution is not just to learn the codebook C, but to introduce a new matrix W and learn CW instead. Simple but powerful.