🤖 AI Summary
Existing unified multimodal models suffer from the “single-tool” paradigm in interleaved text-image generation, limiting their applicability to tasks demanding factual accuracy, procedural precision, or realistic image editing. To address this, we propose a tool-augmented interleaved generation framework that employs a large language model (LLM) as an intelligent orchestrator, dynamically invoking heterogeneous visual tools—including diffusion models, code executors, image editors, and web retrieval modules. Methodologically, we design a reinforcement learning–driven hybrid reward mechanism integrating rule-based constraints and LLM-based evaluation feedback, augmented with test-time scaling for improved generalization. Our approach achieves significant gains over state-of-the-art methods across four established benchmarks and demonstrates robust performance on a newly constructed dataset. The core contribution lies in reframing multimodal generation as a composable, scalable tool-calling problem—enabling flexible, controllable, and factually consistent cross-modal content creation.
📝 Abstract
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.