🤖 AI Summary
This study addresses the need for librarian-assisted subject indexing by proposing a two-stage thematic tag assignment method. First, a dual-encoder model efficiently retrieves coarse candidate tags from a large, structured subject taxonomy. Second, a cross-encoder performs semantic fine-grained re-ranking of the candidate set. The approach uniquely formulates subject indexing as a cascaded information retrieval task, integrating pretrained language models with the hierarchical structure of subject taxonomies—thereby balancing retrieval efficiency and long-tail tag recall. Evaluated on SemEval-2025 Task 5, the method achieves significantly higher recall than single-stage baselines and ranks among the top performers in qualitative assessment. Results demonstrate its effectiveness and practical utility for domain-specific indexing tasks.
📝 Abstract
We present our submission to the Task 5 of SemEval-2025 that aims to aid librarians in assigning subject tags to the library records by producing a list of likely relevant tags for a given document. We frame the task as an information retrieval problem, where the document content is used to retrieve subject tags from a large subject taxonomy. We leverage two types of encoder models to build a two-stage information retrieval system -- a bi-encoder for coarse-grained candidate extraction at the first stage, and a cross-encoder for fine-grained re-ranking at the second stage. This approach proved effective, demonstrating significant improvements in recall compared to single-stage methods and showing competitive results according to qualitative evaluation.