Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of missing data in multimodal remote sensing, where existing cross-modal translation approaches model each modality pair independently, resulting in high complexity and poor generalization. To overcome this, the authors propose a unified any-to-any modality translation framework that formulates translation as inference over a shared latent representation of scene semantics. This is achieved through a geometrically aligned latent space and a shared backbone network to map heterogeneous modalities, complemented by lightweight residual adapters to correct systematic biases. The study also introduces RST-1M, the first million-scale, five-modality paired remote sensing dataset, enabling sparse-supervision training. The proposed method consistently outperforms pairwise approaches across 14 cross-modal tasks and demonstrates strong zero-shot generalization capabilities.

Technology Category

Application Category

📝 Abstract

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models will be available at https://github.com/MiliLab/Any2Any.

Problem

Research questions and friction points this paper is trying to address.

remote sensing

multi-modal translation

incomplete observations

modality generalization

cross-modal translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent diffusion

modality translation

remote sensing