CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

130K/year

🤖 AI Summary

This work addresses zero-shot topic localization in historical Czech documents, aiming to identify text spans that express specific topics. To support this task, the authors construct the first multi-granularity manually annotated benchmark dataset, enabling evaluation at both document and token levels, and introduce a novel evaluation framework grounded in inter-annotator agreement. Experiments leverage large language models (LLMs) alongside distilled, fine-tuned BERT embedding models for topic localization under zero-shot and few-shot settings. Results demonstrate that the best-performing models achieve performance approaching human-level consistency, with certain LLMs exhibiting particularly strong results, while lightweight distilled models remain competitive. The dataset and evaluation framework are publicly released.

Technology Category

Application Category

📝 Abstract

Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.

Problem

Research questions and friction points this paper is trying to address.

topic localization

zero-shot

historical documents

Czech language

text span identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot topic localization

historical Czech documents

human-annotated benchmark