🤖 AI Summary
This work proposes Uni-RS, the first unified multimodal model tailored for remote sensing, addressing the challenge of spatial asymmetry between image-text understanding and text-to-image generation that often leads to inaccurate spatial layouts in generated images. To mitigate this issue, Uni-RS decouples geometric structure from visual content generation through explicit spatial layout planning, spatially aware query supervision, and image-text spatial layout transformation. This approach effectively alleviates the spatial inversion problem between comprehension and generation, significantly improving spatial fidelity in text-to-image synthesis across multiple remote sensing benchmarks. Concurrently, Uni-RS maintains strong performance on a range of multimodal understanding tasks, including image captioning, visual grounding, and visual question answering.
📝 Abstract
Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.