Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

📅 2024-10-20
🏛️ Neural Information Processing Systems
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Visual spatial understanding (VSU) suffers from limited performance in scene-image-to-text (SI2T) and scene-text-to-image (ST2I) tasks due to insufficient 3D spatial modeling. Method: This paper proposes a bidirectional collaborative modeling paradigm, featuring: (1) the first 3D scene graph (3DSG) as a shared cross-task spatial representation; (2) a Spatial Dual Discrete Diffusion (SD³) framework that leverages intermediate features from the easier 3D→X direction to guide the harder X→3D direction, enabling mutual feature enhancement; and (3) integration of dual learning, discrete diffusion, cross-modal knowledge distillation, and joint optimization. Results: On the VSD benchmark, our method significantly outperforms state-of-the-art I2T and T2I models. Ablation studies confirm that bidirectional task collaboration substantially improves 3D spatial understanding capability.

Technology Category

Application Category

📝 Abstract
In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$ o$image and 3D$ o$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$ o$X processes to guide the hard X$ o$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
Problem

Research questions and friction points this paper is trying to address.

Modeling 3D spatial features for image-text tasks
Improving spatial understanding in dual image-text generation
Enhancing both SI2T and ST2I through shared representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual learning framework for SI2T and ST2I tasks
3D scene graph representation for spatial features
Spatial Dual Discrete Diffusion guiding X-to-3D processes
🔎 Similar Papers
No similar papers found.