🤖 AI Summary
Scientific reasoning is often hindered by the difficulty of jointly leveraging structured experimental data and unstructured scientific literature. This work proposes the first hybrid question-answering framework that integrates a relational spectral database with a vector-indexed literature repository. By employing semantic parsing, the framework translates natural language questions into coordinated SQL queries and literature retrieval operations, enabling joint reasoning over quantitative evidence and mechanistic explanations. Built upon Structured and Unstructured Query Language (SUQL) and augmented with Retrieval-Augmented Generation (RAG), the system achieves 80% exact-match accuracy on SQL queries and 93–97% answer groundedness in real-world scientific QA tasks. Expert evaluations further confirm its high performance, yielding scores of 4.1–4.6 out of 5 on dimensions such as accuracy and relevance.
📝 Abstract
Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.