🤖 AI Summary
This study investigates the effectiveness of Retrieval-Augmented Generation (RAG) for question answering over classical Chinese historical archives from Taiwan—the *Diary of Fort Zeelandia* and the *Gazette of the Taiwan Provincial Assembly*. Addressing key challenges in historical texts—including strong temporal dependency, multi-hop queries, and critical reliance on metadata—we propose an early metadata fusion strategy: injecting structured metadata (e.g., time, institutions, persons) into both query and document representations prior to retrieval. Experiments demonstrate substantial improvements in retrieval accuracy (+12.3% MRR) and answer generation quality (+9.7% F1), while mitigating hallucination and reasoning fragmentation inherent to RAG in historical contexts. However, the approach remains challenged by highly abstract or cross-temporal queries requiring implicit logical inference. Our work establishes a reusable, metadata-driven RAG paradigm for intelligent question answering over Chinese historical documents.
📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.