2026-01-06

Jan 06, 2026

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

(Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu)

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

Visual Autoregression을 사용해 이미지 이해와 생성을 통합. (다만 논문의 초점은 생성 과제이긴 하다.)

Unifying image understanding and generation through Visual Autoregression. (Though the main focus of this work is generative tasks.)

#image-generation #image-understanding #autoregressive-model

Solar Open Technical Report

(Upstage Solar Team)

We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for eﬀicient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.

업스테이지의 MoE 모델. 사용한 데이터와 인프라에 대해 비교적 투명. (달성한 MFU를 계산해보고 싶음.)

MoE from Upstage. They are relatively transparent about the data and the infrastructure they have used. (I want to calculate the MFU they have achieved later.)

#llm #synthetic-data

CaveAgent: Transforming LLMs into Stateful Runtime Operators

(Maohao Ran, Zhenglin Wan, Cooper Lin, Yanting Zhang, Hongyu Xin, Hongwei Fan, Yibo Xu, Beier Luo, Yaxin Zhou, Wangbo Zhao, Lijie Yang, Lang Feng, Fuchao Yang, Jingxuan Wu, Yiqiao Huang, Chendong Ma, Dailing Jiang, Jianbo Deng, Sihui Han, Bo An, Yike Guo, Jun Song)

LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from “LLM-as-Text-Generator” to “LLM-as-Runtime-Operator.” We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent’s superiority. Specifically, our framework achieves a 10.5% success rate improvement on retail tasks and reduces total token consumption by 28.4% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.

LLM에 상태를 가진 환경을 제공. Jupyter Notebook을 모델에 도구로 제공하는 것과 비슷한데 단 이 환경에 다른 에이전트가 접근할 수 있음. (커널 재시작 함수를 제공하고 싶어질 듯.)

Providing a stateful environment to LLMs. Basically it is similar to giving a Jupyter notebook as a tool but it can be accessed by other agents. (Definitely you would want to give them the restart kernel function.)

#agent

From Failure to Mastery: Generating Hard Samples for Tool-use Agents

(Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, Cunyin Peng, Xiangyu Zhao, Chenyi Zhuang, Ji Zhang)

The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.

도구 사용 궤적 합성. 실패 사례에서 사용이 어려운 도구를 찾아낸 다음 API 그래프를 사용해 어려운 쿼리와 고급 도구를 생성.

Synthesizing trajectories for tool use. Finding out challenging tools from the failure cases, and building hard queries and advanced tools using the API graph.

#agent

Discussion about this post

Ready for more?