🤖 AI Summary
This work addresses the limited geometric reasoning capability of existing vision-language models (VLMs) and their inability to jointly generate spatially consistent 3D camera trajectories and natural language descriptions from single-view image sequences. We propose the first end-to-end framework that (1) employs a diffusion model to synthesize geometrically plausible 3D trajectories satisfying given pose constraints, (2) renders novel views via 3D Gaussian splatting, and (3) leverages a VLM to produce fine-grained, spatially anchored textual explanations. Our core innovation lies in jointly modeling diffusion priors and geometric constraints to enable co-optimization of trajectory generation and language description, alongside introducing the first unified evaluation metric for spatially aware navigation. Experiments on a dataset of 1,200+ real-world house videos demonstrate significant improvements over sequential baselines, achieving state-of-the-art performance in trajectory geometric plausibility, view coherence, and spatial accuracy of language descriptions.
📝 Abstract
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.