🤖 AI Summary
This study addresses the challenge of automatically extracting key metadata—such as meeting ID, date, location, attendees, and start/end times—from municipal council minutes, which exhibit heterogeneous and non-standardized formats. To tackle this, the authors propose a two-stage pipeline: first, a question-answering model identifies text spans containing target metadata; second, fine-grained entity recognition is performed using Transformer-based models (BERTimbau and XLM-RoBERTa), with and without a Conditional Random Field (CRF) layer, augmented by a delexicalization strategy to enhance generalization. As the first benchmark method specifically designed for metadata extraction from municipal minutes, the approach significantly outperforms general-purpose large language models (e.g., Phi, Gemini) in in-domain settings. Cross-municipality experiments, while revealing generalization difficulties, underscore the linguistic complexity and diversity inherent in municipal documentation.
📝 Abstract
Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.