π€ AI Summary
Multimodal text-to-image generation suffers from weak cross-domain semantic alignment and low detail fidelity. To address these challenges, we propose a domain-specialized multi-agent reinforcement learning framework that establishes collaborative mechanisms among specialized agents for distinct domainsβsuch as architecture, portraiture, and landscape. Our method introduces a novel bidirectional cross-modal alignment module and a composite reward function, integrating Proximal Policy Optimization (PPO), contrastive learning, and a bidirectional-attention-enhanced Transformer architecture, coupled with a multi-round iterative feedback system. Experiments demonstrate substantial improvements: a 1614% increase in generated text vocabulary diversity, significantly enhanced semantic consistency (ROUGE-1 score reduced by 69.7%), and an overall evaluation score of 0.521. This work establishes a new paradigm for fine-grained, controllable, multi-domain text-to-image generation.
π Abstract
Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.