🤖 AI Summary
Visual spatial understanding (VSU) suffers from limited performance in scene-image-to-text (SI2T) and scene-text-to-image (ST2I) tasks due to insufficient 3D spatial modeling. Method: This paper proposes a bidirectional collaborative modeling paradigm, featuring: (1) the first 3D scene graph (3DSG) as a shared cross-task spatial representation; (2) a Spatial Dual Discrete Diffusion (SD³) framework that leverages intermediate features from the easier 3D→X direction to guide the harder X→3D direction, enabling mutual feature enhancement; and (3) integration of dual learning, discrete diffusion, cross-modal knowledge distillation, and joint optimization. Results: On the VSD benchmark, our method significantly outperforms state-of-the-art I2T and T2I models. Ablation studies confirm that bidirectional task collaboration substantially improves 3D spatial understanding capability.
📝 Abstract
In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$ o$image and 3D$ o$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$ o$X processes to guide the hard X$ o$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.