Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of traditional OCR methods, which often yield transcription errors and insufficient semantic annotation when processing scanned historical documents such as Italian parliamentary speeches. The paper proposes the first end-to-end vision–language large model framework specifically designed for historical parliamentary records, jointly modeling document layout and textual content to achieve high-accuracy transcription, semantic segmentation, and speaker identification. By integrating a specialized OCR module, cross-modal joint reasoning, and knowledge-base–guided multi-strategy fuzzy matching for entity linking, the approach substantially improves both transcription fidelity and speaker attribution performance. Extensive evaluation on standard benchmarks demonstrates the method’s effectiveness and novelty in handling complex historical document understanding tasks.

Technology Category

Application Category

📝 Abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Problem

Research questions and friction points this paper is trying to address.

parliamentary speeches

transcription errors

semantic annotation

historical documents

Italian

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

OCR refinement

semantic segmentation