2025년 5월 22일

May 22, 2025

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

(Tencent Hunyuan Team)

As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

텐센트 LLM. Mamba2 하이브리드. 정답 기반 Generative RM을 사용. 헤드 결합 하이브리드 모델도 이번에 나왔던데 (https://falcon-lm.github.io/blog/falcon-h1/) 이미 다들 하이브리드 혹은 Sparse Attention으로 넘어가고 있는 것이 아닌가 싶네요.

Tencent's LLM. It's a Mamba2 hybrid model. They used a generative RM with reference answers. A head-combined hybrid model also appeared recently (https://falcon-lm.github.io/blog/falcon-h1/). Maybe everyone is already moving towards hybrid models or sparse attention.

#llm #state-space-model #reasoning #rl

MMaDA: Multimodal Large Diffusion Language Models

(Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang)

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Masked Diffusion 기반 멀티모달 이미지 인식 및 생성 모델. GRPO로 추론까지 학습시켰군요. 지금이 LLM 패러다임의 전환기인 것은 아닐까요.

A multimodal image understanding and generation model based on masked diffusion. They've also trained the model for reasoning using GRPO. Maybe a paradigm shift in LLMs are happening.

#diffusion #multimodal #reasoning #rl #image-generation

dKV-Cache: The Cache for Diffusion Language Models

(Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang)

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

Diffusion LM에 대해 KV 캐시가 가능한지를 시도했군요. 토큰을 예측한 다음에는 KV 임베딩이 크게 변화하지 않는다는 것을 활용했다고. Causal LM처럼 학습 방식 자체가 KV 캐시를 허용하는 방법이 가능할까요.

They attempted to implement KV cache for diffusion language models. They utilized the observation that KV embeddings don't change significantly after a token is predicted. I wonder if it's possible to develop a training objective that inherently allows for KV caching, similar to causal LM.

#diffusion #lm #efficiency

Scaling Diffusion Transformers Efficiently via μP

(Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li)

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (μP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether μP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard μP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that μP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-α, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing μP methodologies. Leveraging this result, we systematically demonstrate that DiT-μP enjoys robust HP transferability. Notably, DiT-XL-2-μP with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of μP on text-to-image generation by scaling PixArt-α from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under μP outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-α and 3% of consumption by human experts for MMDiT-18B. These results establish μP as a principled and efficient framework for scaling diffusion Transformers.

바이트댄스의 Diffusion에 대한 μP 적용 실험. 바이트댄스가 학습 테크닉에 대해서 이런저런 실험을 많이 하고 동시에 그걸 공개하고 있네요.

ByteDance's experiment on applying μP to diffusion models. ByteDance seems to be conducting various experiments on training techniques and also publishes the results.

#hyperparameter #diffusion

Text Generation Beyond Discrete Token Sampling

(Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao)

In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

Continuous Thought. 추론 모델에 대해서 학습 없이 출력 토큰 분포로 임베딩을 가중합해서 입력으로 썼군요. Weight Merging과 연관이 있을지도 모르겠군요. 비슷한 접근이 같이 나왔습니다 (https://arxiv.org/abs/2505.15778).

Continuous thought. This method uses a weighted combination of embeddings based on the output token distribution for reasoning language models, without additional training. It might be related to weight merging. A similar approach was also published (https://arxiv.org/abs/2505.15778).

#reasoning