ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Traditional approaches rely on manually designed annotation schemas and exhaustive document labeling, which are costly and difficult to scale. This work proposes an end-to-end framework leveraging large language models to automatically transform natural language research questions and raw text into structured databases, supported by an interactive interface that enables user-guided refinement. The method establishes, for the first time, a closed-loop pipeline from research questions to structured evidence, integrating expert feedback and domain-adaptation mechanisms. Evaluated in legal and computational biology domains, it significantly enhances the efficiency and accuracy of cross-domain information extraction. The system, along with its public web interface, has been open-sourced.

Technology Category

Application Category

📝 Abstract

Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

Problem

Research questions and friction points this paper is trying to address.

structured data

annotation schema

document collections

research question

evidence extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive schema discovery

large language models

structured data extraction