Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models struggle to reliably generate images from long-range, compositional prompts. To address this, we propose a multi-expert collaborative framework that employs reflective reinforcement learning—formulated as a Markov decision process—to enable end-to-end dynamic task decomposition and expert model scheduling. The framework integrates pre-trained text-to-image, image-to-image, and vision-language models, with the latter serving as a structured critic that provides fine-grained, interpretable feedback. Our key contribution is the autonomous learning of an optimal expert invocation policy, effectively transcending the inherent limitations of individual models. Evaluated on both standard and custom benchmarks, our method achieves significant improvements over state-of-the-art approaches across three critical dimensions: prompt alignment, image fidelity, and aesthetic quality. Human evaluation further confirms substantial gains in user preference.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.
Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable generation of long compositional prompts in text-to-image systems
Orchestrates multiple pretrained experts through dynamic task decomposition
Improves image alignment and fidelity via reinforcement learning with structured feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reflective RL orchestrates diverse pretrained image experts
Dynamic task decomposition handles long-form prompts end-to-end
Vision-language model supervises alignment via structured feedback
🔎 Similar Papers
No similar papers found.
H
Hossein Mohebbi
University of Waterloo
M
Mohammed Abdulrahman
University of Waterloo
Y
Yanting Miao
University of Waterloo
Pascal Poupart
Pascal Poupart
University of Waterloo
Artificial IntelligenceMachine LearningReinforcement LearningFederated LearningNLP
Suraj Kothawade
Suraj Kothawade
Google
Machine Learning and Computer Vision