A Preliminary Study of RAG for Taiwanese Historical Archives

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the effectiveness of Retrieval-Augmented Generation (RAG) for question answering over classical Chinese historical archives from Taiwan—the *Diary of Fort Zeelandia* and the *Gazette of the Taiwan Provincial Assembly*. Addressing key challenges in historical texts—including strong temporal dependency, multi-hop queries, and critical reliance on metadata—we propose an early metadata fusion strategy: injecting structured metadata (e.g., time, institutions, persons) into both query and document representations prior to retrieval. Experiments demonstrate substantial improvements in retrieval accuracy (+12.3% MRR) and answer generation quality (+9.7% F1), while mitigating hallucination and reasoning fragmentation inherent to RAG in historical contexts. However, the approach remains challenged by highly abstract or cross-temporal queries requiring implicit logical inference. Our work establishes a reusable, metadata-driven RAG paradigm for intelligent question answering over Chinese historical documents.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.
Problem

Research questions and friction points this paper is trying to address.

Investigating RAG for Taiwanese historical archives retrieval
Analyzing metadata integration effects on retrieval accuracy
Addressing hallucinations in historical query generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG pipeline for Taiwanese historical archives
Metadata integration improves retrieval and accuracy
Addresses hallucinations and temporal query challenges
🔎 Similar Papers
No similar papers found.
C
Claire Lin
Department of Information Management, National Taiwan University
B
Bo-Han Feng
Department of Computer Science and Information Engineering, National Taiwan University
X
Xuan-Bo Chen
Graduate Institute of Communication Engineering, National Taiwan University
T
Te-Lun Yang
Graduate Institute of Networking and Multimedia, National Taiwan University
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing
Jyh-Shing Roger Jang
Jyh-Shing Roger Jang
Professor of Computer Science and Information Engineering Department, National Taiwan University
Machine LearningMusic Analysis & RetrievalSpeech Recognition & ScoringHealthcare and Medical AnalyticsFinTech