A Preliminary Study of RAG for Taiwanese Historical Archives

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates the effectiveness of Retrieval-Augmented Generation (RAG) for question answering over classical Chinese historical archives from Taiwan—the *Diary of Fort Zeelandia* and the *Gazette of the Taiwan Provincial Assembly*. Addressing key challenges in historical texts—including strong temporal dependency, multi-hop queries, and critical reliance on metadata—we propose an early metadata fusion strategy: injecting structured metadata (e.g., time, institutions, persons) into both query and document representations prior to retrieval. Experiments demonstrate substantial improvements in retrieval accuracy (+12.3% MRR) and answer generation quality (+9.7% F1), while mitigating hallucination and reasoning fragmentation inherent to RAG in historical contexts. However, the approach remains challenged by highly abstract or cross-temporal queries requiring implicit logical inference. Our work establishes a reusable, metadata-driven RAG paradigm for intelligent question answering over Chinese historical documents.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.

Problem

Research questions and friction points this paper is trying to address.

Investigating RAG for Taiwanese historical archives retrieval

Analyzing metadata integration effects on retrieval accuracy

Addressing hallucinations in historical query generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG pipeline for Taiwanese historical archives

Metadata integration improves retrieval and accuracy

Addresses hallucinations and temporal query challenges

🔎 Similar Papers

Chronicling Germany: An Annotated Historical Newspaper Dataset