🤖 AI Summary
Traditional approaches rely on manually designed annotation schemas and exhaustive document labeling, which are costly and difficult to scale. This work proposes an end-to-end framework leveraging large language models to automatically transform natural language research questions and raw text into structured databases, supported by an interactive interface that enables user-guided refinement. The method establishes, for the first time, a closed-loop pipeline from research questions to structured evidence, integrating expert feedback and domain-adaptation mechanisms. Evaluated in legal and computational biology domains, it significantly enhances the efficiency and accuracy of cross-domain information extraction. The system, along with its public web interface, has been open-sourced.
📝 Abstract
Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com