ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current text-to-image (T2I) models often produce semantically inconsistent or visually distorted images under ambiguous prompts. Mainstream mitigation strategies—such as prompt rewriting, best-N sampling, and self-refinement—rely on auxiliary modules or operate independently, resulting in poor test-time scalability and high computational overhead. This paper introduces ImAgent: a training-free, test-time scalable multimodal agent framework that unifies reasoning, generation, and self-assessment into a single, dynamically coordinated pipeline. A policy controller orchestrates multiple generative actions and establishes a closed-loop feedback optimization mechanism—without external models or fine-tuning. ImAgent significantly improves both image fidelity and semantic alignment. Experiments demonstrate consistent superiority over baselines across diverse generation and editing tasks, with exceptional robustness and generalization under complex or ambiguous prompts.

Technology Category

Application Category

📝 Abstract

Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

Problem

Research questions and friction points this paper is trying to address.

Addresses randomness and inconsistency in text-to-image generation models

Mitigates semantic misalignment when textual prompts are vague

Reduces computational overhead from independent modules during test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal agent integrates reasoning generation evaluation

Policy controller enables dynamic self-organization without external models

Training-free framework enhances image fidelity semantic alignment

🔎 Similar Papers

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation