🤖 AI Summary
Existing diffusion models struggle to simultaneously ensure image diversity and scene-attribute fidelity—particularly regarding object interactions and spatial relations—under multimodal scene-aware generation tasks (e.g., visual question answering, human-object interaction reasoning) conditioned on reference images and text queries. To address this, we propose the first scene-faithful multimodal contextual alignment diffusion framework: (1) a differentiable multimodal consistency reward integrating CLIP and HOI-aware features; (2) a context evaluator jointly optimizing global semantic and fine-grained structural alignment; and (3) a contrastive enhancement sampling strategy to improve diversity. Furthermore, we introduce the first benchmark enabling joint evaluation of scene fidelity and diversity. On MME Perception and Bongard HOI, our method achieves a 23.6% improvement in scene-attribute fidelity over state-of-the-art approaches, demonstrating significant gains in both faithfulness and generative quality.
📝 Abstract
While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.