Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates the trade-off between efficiency and accuracy in semi-automatic transcription for spoken language corpus construction. Through a two-stage experiment, it compares the performance of expert and novice transcribers on three types of Italian conversational data under both manual and ASR-assisted conditions. The work proposes an integrated analytical framework combining word-level alignment, quality evaluation metrics, and statistical modeling to systematically quantify behavioral differences across transcription workflows. Results demonstrate that ASR substantially increases transcription speed, yet its impact on accuracy varies depending on dialogue type, transcriber expertise, and workflow configuration. The findings provide empirical support for the development of the KIParla corpus, showing that a fine-tuned and optimized semi-automatic pipeline can effectively accelerate annotation while maintaining high transcription quality.

Technology Category

Application Category

πŸ“ Abstract
This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
corpus creation
transcription workflow
spoken Italian
ASR-assisted transcription
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Speech Recognition
corpus creation
transcription workflow
word-level alignment
statistical modeling
πŸ”Ž Similar Papers
No similar papers found.
M
Martina Simonotti
University of Bologna, Bologna - Italy
Ludovica Pannitto
Ludovica Pannitto
NLP Lab Manager
Computational LinguisticsSemanticsDistributional Semantics
E
Eleonora Zucchini
Masaryk University, Brno - Czech Republic
S
Silvia Ballarè
University of Bologna, Bologna - Italy
Caterina Mauri
Caterina Mauri
University of Bologna
linguistic typologycognitive linguisticspragmaticsgrammaticalizationcategorization