Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes Uni-RS, the first unified multimodal model tailored for remote sensing, addressing the challenge of spatial asymmetry between image-text understanding and text-to-image generation that often leads to inaccurate spatial layouts in generated images. To mitigate this issue, Uni-RS decouples geometric structure from visual content generation through explicit spatial layout planning, spatially aware query supervision, and image-text spatial layout transformation. This approach effectively alleviates the spatial inversion problem between comprehension and generation, significantly improving spatial fidelity in text-to-image synthesis across multiple remote sensing benchmarks. Concurrently, Uni-RS maintains strong performance on a range of multimodal understanding tasks, including image captioning, visual grounding, and visual question answering.

Technology Category

Application Category

📝 Abstract

Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.

Problem

Research questions and friction points this paper is trying to address.

spatial faithfulness

remote sensing

text-to-image generation

spatial asymmetry

multimodal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Layout Planning

Spatial-Aware Query Supervision

Image-Caption Spatial Layout Variation