2025년 1월 20일
Evolving Deeper LLM Thinking
(Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen)
We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.
진화 알고리즘을 통한 Inference Time Scaling. Critic을 사용해 생성 텍스트를 개선하는 단계가 들어가는군요. 진화 알고리즘을 통한 프롬프트 최적화 등의 시도가 있었죠. 그런 시도의 연장 같긴 하네요.
Inference time scaling using an evolutionary algorithm. It incorporates a stage that improves generated text using a critic. There have been previous attempts at using evolutionary algorithms for tasks like prompt optimization. This approach seems to be an extension of those efforts.
#evolutionary-algorithm #inference-time-scaling
HiMix: Reducing Computational Complexity in Large Vision-Language Models
(Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang, Zhongying Liu, Yujie Zhong, Zheng Zhao, Tongtong Yuan)
Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: https://xuange923.github.io/HiMix
Self Attention의 Key/Value에 이미지 토큰을 추가하는 형태로 이미지-텍스트 결합. Self Attention과 Cross Attention을 합쳐놓은 형태군요.
Image-text fusion by adding image tokens to the key and value of self-attention. This approach combines self-attention and cross-attention.
#multimodal #vision-language
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
(Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi)
Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization, leading to inefficiency in token allocation. In this study, we introduce One-D-Piece, a discrete image tokenizer designed for variable-length tokenization, achieving quality-controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named "Tail Token Drop" into discrete one-dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state-of-the-art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable-rate methods. Our approach demonstrates the versatility of variable-length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.
가변 길이 이미지 토크나이저. Semantic Tokenizer 위에 학습 시 일부 토큰을 Drop하는 방법을 사용했습니다. Matryoshka Embedding 같이 생각할 수 있겠네요.
Variable-length image tokenizer. The method involves dropping some tokens during training with semantic tokenizers. This approach can be thought of as similar to Matryoshka Embedding.
#tokenizer #image-generation