🤖 AI Summary
This study addresses the low efficiency and poor scalability of manual concept definition extraction from academic literature in media bias research. To this end, we propose TaxoMatic—the first LLM-driven, end-to-end framework for domain-specific knowledge structuring focused on automatic definition extraction. TaxoMatic integrates a three-stage pipeline: data acquisition, relevance classification, and definition extraction, leveraging rule-augmented prompt engineering and fine-tuning of large language models (e.g., Claude-3-Sonnet) on human-annotated data. Evaluated on a corpus of 2,398 manually annotated scholarly articles, TaxoMatic significantly outperforms baseline methods in both relevance classification and definition extraction, demonstrating the feasibility and effectiveness of systematic LLM application to academic definition extraction. Its core contribution lies in establishing a domain-adapted definition extraction paradigm and providing a reproducible technical pathway for constructing high-quality domain-specific terminology knowledge bases.
📝 Abstract
This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.