🤖 AI Summary
To address the scarcity of high-quality video data from Mars and the severe domain shift between Earth and Mars imagery, this paper proposes the M3arsSynth multimodal data generation pipeline and the MarsGen conditional video generation model. Methodologically, it integrates NASA PDS stereo navigation images for geometrically consistent 3D reconstruction, combines physics-informed terrain modeling with diffusion model fine-tuning, and enables controllable video synthesis guided by text prompts and robotic trajectories. The key contributions are: (1) the first video generation framework leveraging authentic Mars stereo data to produce high-fidelity, meter-resolution, replayable videos with geometrically consistent 3D structure; and (2) superior performance over Earth-pretrained models, maintaining both visual realism and geometric accuracy across diverse Martian terrains and illumination conditions—establishing a reliable visual foundation for mission rehearsal and robotic simulation.
📝 Abstract
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.