Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods decouple image manipulation from network search, rely on costly reinforcement learning, and lack planning grounded in real tool-execution trajectories. This work proposes a dynamic interleaved reasoning mechanism that unifies multimodal planning, visual manipulation, and deep search for the first time—enabling long-horizon intelligent behavior without reinforcement learning under a supervised fine-tuning framework. Our approach leverages a high-quality dataset of <30K planning-execution aligned trajectories and introduces stepwise consistency filtering to ensure reliable reasoning. Evaluated on MMSearch and FVQA, it achieves 66.1 and 67.2, respectively—surpassing Gemini 2.5 Flash across all 11 benchmark metrics. Moreover, it supports complex cross-modal task solving requiring over ten sequential tool invocations.

Technology Category

Application Category

📝 Abstract
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Unifies multimodal planning and image manipulation with deep search.
Addresses disjoint capabilities and heavy reliance on reinforcement learning.
Enables interleaved reasoning between visual operations and knowledge retrieval.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multimodal planning with active image manipulation
Integrates deep multimodal search and interleaved visual-knowledge reasoning
Achieves state-of-the-art performance via supervised fine-tuning only
🔎 Similar Papers
No similar papers found.
Y
Yifan Zhang
Multimodality Team, Skywork AI
L
Liang Hu
Multimodality Team, Skywork AI
H
Haofeng Sun
Multimodality Team, Skywork AI
P
Peiyu Wang
Multimodality Team, Skywork AI
Yichen Wei
Yichen Wei
SHUKUN Technology
deep learningcomputer visionmedical image analysis
Shukang Yin
Shukang Yin
University of Science and Technology of China
Computer VisionMultimodal Learning
J
Jiangbo Pei
Multimodality Team, Skywork AI
W
Wei Shen
Multimodality Team, Skywork AI
Peng Xia
Peng Xia
PhD student, Department of Computer Science, UNC Chapel Hill
Multimodal AgentHealthcare
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
T
Tianyidan Xie
Multimodality Team, Skywork AI
E
Eric Li
Multimodality Team, Skywork AI
Y
Yang Liu
Multimodality Team, Skywork AI
Xuchen Song
Xuchen Song
CTO @ Mureka.ai | Head of Multimodality & Spatial AI @ Skywork.ai
Music GenerationMultimodalityMultimodal UnderstandingMultimodal Generation
Y
Yahui Zhou
Multimodality Team, Skywork AI