GeoGPT.RAG Technical Report

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from insufficient domain expertise in Earth sciences and struggle to integrate user-specific literature. To address this, we propose GeoGPT—an open-source retrieval-augmented generation (RAG) system tailored for geoscience research. Methodologically, we construct a specialized geoscience knowledge base and introduce a domain-adaptive retrieval framework comprising two novel components: GeoEmbedding, a lightweight, open-source geoscience-specific embedding model; and GeoReranker, a fine-grained re-ranking model—both domain-finetuned to enhance semantic alignment and retrieval accuracy. The system further supports dynamic construction of personalized knowledge bases via user-uploaded documents. Experiments demonstrate significant improvements in both answer accuracy and factual consistency for geoscience question answering. Core modules are publicly released under open-source licenses, fostering transparency, reproducibility, and collaborative advancement of AI tools in Earth sciences.

Technology Category

Application Category

📝 Abstract
GeoGPT is an open large language model system built to advance research in the geosciences. To enhance its domain-specific capabilities, we integrated Retrieval Augmented Generation(RAG), which augments model outputs with relevant information retrieved from an external knowledge source. GeoGPT uses RAG to draw from the GeoGPT Library, a specialized corpus curated for geoscientific content, enabling it to generate accurate, context-specific answers. Users can also create personalized knowledge bases by uploading their own publication lists, allowing GeoGPT to retrieve and respond using user-provided materials. To further improve retrieval quality and domain alignment, we fine-tuned both the embedding model and a ranking model that scores retrieved passages by relevance to the query. These enhancements optimize RAG for geoscience applications and significantly improve the system's ability to deliver precise and trustworthy outputs. GeoGPT reflects a strong commitment to open science through its emphasis on collaboration, transparency, and community driven development. As part of this commitment, we have open-sourced two core RAG components-GeoEmbedding and GeoReranker-to support geoscientists, researchers, and professionals worldwide with powerful, accessible AI tools.
Problem

Research questions and friction points this paper is trying to address.

Advancing geoscience research with open large language model
Enhancing domain-specific capabilities using Retrieval Augmented Generation
Improving retrieval quality through fine-tuned embedding and ranking models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated Retrieval Augmented Generation for geosciences
Fine-tuned embedding and ranking models for relevance
Open-sourced GeoEmbedding and GeoReranker components
🔎 Similar Papers
No similar papers found.
F
Fei Huang
GeoGPT Team Zhejiang Lab
F
Fan Wu
GeoGPT Team Zhejiang Lab
Zeqing Zhang
Zeqing Zhang
The University of Hong Kong
robotic manipulationmulti-agent systemcollision detection
Q
Qihao Wang
GeoGPT Team Zhejiang Lab
L
Long Zhang
GeoGPT Team Zhejiang Lab
G
Grant M. Boquet
GeoGPT Team Zhejiang Lab
Hongyang Chen
Hongyang Chen
SUN YAT-SEN UNIVERSITY
SDNCloud ComputingMicroserviceAIOps