CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot topic localization in historical Czech documents, aiming to identify text spans that express specific topics. To support this task, the authors construct the first multi-granularity manually annotated benchmark dataset, enabling evaluation at both document and token levels, and introduce a novel evaluation framework grounded in inter-annotator agreement. Experiments leverage large language models (LLMs) alongside distilled, fine-tuned BERT embedding models for topic localization under zero-shot and few-shot settings. Results demonstrate that the best-performing models achieve performance approaching human-level consistency, with certain LLMs exhibiting particularly strong results, while lightweight distilled models remain competitive. The dataset and evaluation framework are publicly released.

Technology Category

Application Category

📝 Abstract
Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.
Problem

Research questions and friction points this paper is trying to address.

topic localization
zero-shot
historical documents
Czech language
text span identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot topic localization
historical Czech documents
human-annotated benchmark
span-level evaluation
distilled token embedding
🔎 Similar Papers
No similar papers found.
M
Martin Kostelník
Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
Michal Hradiš
Michal Hradiš
Brno University of Technology
Computer VisionPattern Recognition
M
Martin Dočekal
Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic