LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models in lunar exploration due to the absence of large-scale, planetary science–specific multimodal datasets. To bridge this gap, the authors introduce LUCID, the first multimodal dataset tailored for lunar missions, comprising 96k image–caption pairs and 81k question–answer pairs. Building upon the LLaVA architecture, they propose a two-stage fine-tuning strategy: first aligning remote sensing imagery with scientific text through concept alignment, followed by instruction tuning to enable multi-level reasoning. Experimental results demonstrate that the resulting model, LLaVA-LE, achieves a 3.3× overall performance improvement over baseline models under GPT- and Gemini-based evaluations, with a reasoning score of 1.070—surpassing human reference answers—and substantially enhancing domain-specific reasoning capabilities for planetary science tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

Problem

Research questions and friction points this paper is trying to address.

multimodal vision-language models

planetary science

lunar exploration

dataset scarcity

scientific image captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal vision-language model

lunar exploration

domain-specific fine-tuning