2024년 12월 10일

Dec 10, 2024

Training Large Language Models to Reason in a Continuous Latent Space

(Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian)

Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.

트랜스포머의 임베딩 출력을 토큰으로 바꾸는 대신 그대로 다음 시점의 입력으로 사용해 Continuous 토큰으로 Chain of Thought를 하도록 한다는 아이디어. Continuous 토큰을 사용하면 하나의 체인이 아니라 잠재적으로 다양한 경로가 토큰 안에 포함될 수 있죠. 다만 학습할 때 Chain of Thought 토큰으로 Supervision을 주면서 점진적으로 Continuous 토큰으로 바꿔나가야만 작동하는군요.

This paper presents the idea of performing chain-of-thought reasoning using continuous tokens by directly feeding the transformer's embedding outputs as input for the next step, instead of converting them into discrete tokens. By using continuous tokens, multiple potential reasoning paths can be encoded within a single token, rather than being limited to a single chain. However, it's worth noting that this approach only works when the model is gradually trained to use continuous tokens while being supervised with chain-of-thought tokens.

#reasoning

Language-Guided Image Tokenization for Generation

(Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu)

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.

이미지 토큰에 텍스트 임베딩을 결합해서 보강하는 방법. Semantic Token의 확장 버전 같은 느낌이군요. (https://arxiv.org/abs/2406.07550)

A method of enhancing image tokens by combining them with text embeddings. It feels like an extended version of semantic tokens. (https://arxiv.org/abs/2406.07550)

#tokenizer #image-generation

Visual Lexicon: Rich Image Features in Language Space

(XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid)

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.

ViT 출력 토큰을 Text to Image Diffusion의 텍스트 입력으로 사용해 Reconstruction Loss로 학습. 이쪽도 Semantic Token과 비슷하다고 할 수 있겠네요.

The method trains a ViT by using its output tokens as text input for a text-to-image diffusion model and applying reconstruction loss. This approach also similar to the semantic tokens.

#tokenizer #image-generation

Mixture of Hidden-Dimensions Transformer

(Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang)

Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency

각 레이어에서 임베딩의 Hidden Dimension의 일부만을 사용하도록 하는 모델. 마치 DeepSeekMoE처럼 모든 토큰에 대해 사용되는 Shared Dimension과 토큰에 따라 결정되는 Specialized Dimension을 분리했습니다.

This model uses only a portion of the embedding's hidden dimensions for each layer. Similar to DeepSeekMoE, it separates dimensions into shared dimensions used by all tokens and specialized dimensions that are dynamically assigned to individual tokens. #moe #sparsity

Normalizing Flows are Capable Generative Models

(Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind)

Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/ml-tarflow.

Normalizing Flow를 시도한 연구가 하나 더 나왔군요. Autoregressive Flow를 통해서 이미지 모델링을 했습니다.

Another study using normalizing flows has emerged. They performed image modeling using autoregressive flow.

#flow

ProcessBench: Identifying Process Errors in Mathematical Reasoning

(Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin)

As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

Process Reward 모델과 Critic 모델을 위한 벤치마크. GSM8K, MATH, OlympiadBench, Omni-MATH의 문제들에 대해 LLM으로 정답을 생성하고 사람이 오류가 발생한 지점을 어노테이션하는 방식으로 구축했네요.

PRM의 경우에는 학습에 사용한 데이터 분포의 차이와 정답의 정확성으로 과정의 정확성을 평가하는 방법의 한계로 성능에 문제가 있네요. 프롬프팅한 Critic 모델에도 밀리는 결과가 나타납니다. 그냥 PRM800K로 학습한 모델이 가장 나았다고 합니다.

Critic 모델의 경우에는, 특히 추론 모델의 경우에는 추론 과정을 통해 더 나은 성능에 도달할 수 있었다고 하네요.

그렇다면 QwQ는 어떻게 만든 걸까요.

A benchmark for process reward models and critic models. They built the benchmark by generating solutions using LLMs for problems from GSM8K, MATH, OlympiadBench, and Omni-MATH, with humans annotating the points where errors occur.

PRMs show performance issues due to differences in the distribution of training data and limitations in evaluating process accuracy based solely on answer correctness. They even underperform prompted critic models. Interestingly, a model simply trained on PRM800K performed the best among PRMs.

For critic models, especially reasoning models, they were able to achieve better performance through separate reasoning processes.

I wonder how QwQ was developed.

#reward-model #reasoning #math #benchmark

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

(Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, Yuxiao Dong)

This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in post-training of LLMs, its scaling potential is still largely unknown. We systematically analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance. Our findings show that increasing data diversity and volume improves reward model performance, helping process-supervision models scale better. For policy training, more response samples per prompt boost performance initially but quickly plateau. And larger reward models offer modest gains in policy training. In addition, larger policy models benefit less from RLHF with a fixed reward model. Overall, RLHF scales less efficiently than pretraining, with diminishing returns from additional computational resources. Based on these observations, we propose strategies to optimize RLHF performance within computational limits.

프롬프트의 수, 프롬프트 당 샘플 수, Policy와 Reward 모델 크기에 따른 차이 등 RLHF의 Scaling에 대한 연구. 여기서도 Process Reward Model이 학습 도메인 외에 대해서는 잘 작동하지 않는다는 문제를 발견했네요.

A study on the scaling of RLHF, examining factors such as the number of prompts, samples per prompt, and differences in policy and reward model sizes. This research also identified that the process reward model does not perform well outside its training domain.

#rlhf #reward-model

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

(Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, Mikhail Yurochkin)

Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.

Observational Scaling Law처럼 (https://arxiv.org/abs/2405.10938) Latent Skill을 통한 Task Scaling Law 추정이네요. 여기서는 추출한 Latent Skill에 대해 추론, 지식, Instruction Following이라는 분류도 시도했습니다. 추론 능력은 학습 토큰보다 모델 크기에 영향을 받는다거나 Instruction Tuning에 대한 영향이 뚜렷하지 않다는 것 같은 흥미로운 지점들이 있네요.

이런 형태의 Latent Skill을 추출했다면 새로운 과제에 대해서도 요인의 가중치를 추정해 큰 모델이 어느 정도의 성능을 보일지를 예측할 수 있죠. 재미있는 방향이라고 생각합니다.

Similar to the observational scaling law (https://arxiv.org/abs/2405.10938), this paper estimates task scaling laws using latent skills. In this study, they attempted to classify the extracted latent skills into reasoning, knowledge, and instruction following. There are interesting findings, such as reasoning ability being more influenced by model size rather than the number of training tokens, and that the effect of instruction tuning is not very apparent.

If we can extract these types of latent skills, we could predict how large models would perform on new tasks by estimating the weight of factors for those tasks. I think this is an interesting direction for research.

#scaling-law

2024년 12월 10일

Discussion about this post