ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing challenge in Earth science where explosive data growth hinders effective alignment between research objectives and heterogeneous datasets, as existing retrieval systems struggle to interpret high-level scientific intent. To bridge this gap, we propose a multi-stage, reasoning-enhanced data discovery framework that maps scientific questions to relevant datasets through a three-phase pipeline: intent parsing, high-recall retrieval, and context-aware reranking. Our approach innovatively integrates intent understanding, multi-stage retrieval, and large language model–based reranking, explicitly decoupling recall and precision objectives. We also construct the first evaluation benchmark derived from real scientific literature. Experimental results demonstrate that our method significantly outperforms baseline approaches on task-oriented queries, achieving consistent improvements in both recall and ranking performance—particularly for abstractly formulated scientific goals.

Technology Category

Application Category

📝 Abstract
The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results underscore the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.
Problem

Research questions and friction points this paper is trying to address.

Earth Science data discovery
scientific intent
heterogeneous metadata
data retrieval bottleneck
dataset identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-stage search
intent-aware retrieval
semantic embedding
large language model reranking
Earth Science data discovery
🔎 Similar Papers
No similar papers found.
Y
Youran Sun
Department of Mathematics, University of Maryland, College Park, MD, USA
Y
Yixin Wen
Department of Geography, University of Florida, Gainesville, FL, USA
Haizhao Yang
Haizhao Yang
Department of Mathematics, Department of Computer Science, University of Maryland College Park
Data sciencemachine learninghigh-performance computingnumerical linear algebraapplied and