PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
A lack of high-quality benchmark datasets hinders long-document summarization research for Brazilian Portuguese. Method: We introduce the first Portuguese-language long-document summarization dataset—comprising 1,200+ instances—derived from publicly available Brazilian Chamber of Deputies hearing transcripts, annotated with original transcripts, official press releases, and structured summaries (including speaker identities and stances). We propose a hybrid extractive-generative summarization architecture that explicitly models multi-granularity structured information. To address hallucination, we pioneer an NLI-annotated anti-hallucination evaluation subset and design an LLM-driven protocol for quantitative hallucination assessment. Contributions: (1) the first open-source, community-accessible benchmark for Portuguese long-document summarization; (2) reproducible baseline systems; and (3) advancement of both Portuguese-specific summarization model development and standardized, trustworthy evaluation methodologies.

Technology Category

Application Category

📝 Abstract
This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion on evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also provides annotated data that can be used in Natural Language Inference tasks in Portuguese.
Problem

Research questions and friction points this paper is trying to address.

Summarizing long Brazilian Portuguese public hearing transcripts
Developing evaluation metrics for summarization with large language models
Addressing hallucination challenges in generated summaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Brazilian Portuguese dataset for long document summarization
Hybrid summarization system establishing baseline performance
Annotated data for evaluating natural language inference tasks
🔎 Similar Papers
No similar papers found.