🤖 AI Summary
This study addresses the low quality and poor real-time performance of Estonian-language intralingual television subtitles. We propose an end-to-end subtitle generation framework integrating fine-tuned Whisper, iterative semi-supervised pseudo-labeling for data expansion, and test-time post-editing using large language models (LLMs). Crucially, we empirically demonstrate that applying LLMs solely during inference—rather than incorporating them into training—yields substantial accuracy gains, while their inclusion in training confers no benefit. Evaluated on authentic TV corpora, our system achieves word error rate (WER) and readability metrics approaching human-level performance, meeting practical deployment requirements. Furthermore, we release the first high-quality, speech-text aligned Estonian dataset, publicly available for research. Our results confirm that synergistic use of pseudo-labeling and LLM-based post-editing effectively bridges the quality gap between automatic and manual subtitles, establishing a scalable technical pathway for real-time intralingual subtitling in low-resource languages.
📝 Abstract
This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.