🤖 AI Summary
This work addresses the limitations of current text-to-image generation models, which typically rely on static decoders and struggle to interpret implicit user intent, perform complex knowledge-based reasoning, or adapt to dynamic real-world changes. To overcome these challenges, we propose a unified agent framework that reimagines image generation as a dynamic, knowledge-driven workflow, emulating the human cognitive process of “think–retrieve–create.” By actively retrieving multimodal evidence and invoking external reasoning tools, our approach explicitly resolves implicit visual constraints. This study pioneers the integration of cognitive search and reasoning mechanisms into the generative pipeline, marking a paradigm shift from static synthesis to dynamic knowledge-guided creation. We also introduce Mind-Bench, the first comprehensive evaluation benchmark encompassing real-time news and emerging concepts. Experiments demonstrate that our method enables the Qwen-Image baseline to achieve a breakthrough from zero to one on Mind-Bench and establishes state-of-the-art performance on standard benchmarks such as WISE and RISE.
📝 Abstract
While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like'think-research-create'paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.