Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing diffusion models struggle to simultaneously ensure image diversity and scene-attribute fidelity—particularly regarding object interactions and spatial relations—under multimodal scene-aware generation tasks (e.g., visual question answering, human-object interaction reasoning) conditioned on reference images and text queries. To address this, we propose the first scene-faithful multimodal contextual alignment diffusion framework: (1) a differentiable multimodal consistency reward integrating CLIP and HOI-aware features; (2) a context evaluator jointly optimizing global semantic and fine-grained structural alignment; and (3) a contrastive enhancement sampling strategy to improve diversity. Furthermore, we introduce the first benchmark enabling joint evaluation of scene fidelity and diversity. On MME Perception and Bongard HOI, our method achieves a 23.6% improvement in scene-attribute fidelity over state-of-the-art approaches, demonstrating significant gains in both faithfulness and generative quality.

Technology Category

Application Category

📝 Abstract

While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.

Problem

Research questions and friction points this paper is trying to address.

Generates diverse images with high fidelity

Preserves scene attributes in multimodal context

Outperforms existing methods in complex visual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Context Evaluator optimizes semantic rewards

Diffusion-based generator ensures high fidelity images

New benchmark incorporates MME and Bongard datasets

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

PhD - Effiziente Neuronale Repräsentation von Datensätzen

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)