ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
Traditional approaches rely on manually designed annotation schemas and exhaustive document labeling, which are costly and difficult to scale. This work proposes an end-to-end framework leveraging large language models to automatically transform natural language research questions and raw text into structured databases, supported by an interactive interface that enables user-guided refinement. The method establishes, for the first time, a closed-loop pipeline from research questions to structured evidence, integrating expert feedback and domain-adaptation mechanisms. Evaluated in legal and computational biology domains, it significantly enhances the efficiency and accuracy of cross-domain information extraction. The system, along with its public web interface, has been open-sourced.

Technology Category

Application Category

📝 Abstract
Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com
Problem

Research questions and friction points this paper is trying to address.

structured data
annotation schema
document collections
research question
evidence extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive schema discovery
large language models
structured data extraction
domain expert collaboration
open-source framework