Chunk Knowledge Generation Model for Enhanced Information Retrieval: A Multi-task Learning Approach

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address lexical mismatch between queries and documents in information retrieval, existing query expansion methods suffer from context sensitivity and unstable performance, while document expansion approaches (e.g., Doc2Query) incur high preprocessing overhead, index bloat, and low generation reliability. This paper proposes a structured document enhancement paradigm: documents are segmented into semantically coherent chunks, and a single-encoder dual-decoder multitask learning framework—built upon T5—is trained to jointly generate chunk-level titles, candidate questions, and keywords in parallel. The generated metadata enriches retrieval inputs without modifying the underlying index structure. Evaluated on 305 query–document pairs, our method achieves 95.41% Top@10 accuracy, substantially outperforming baseline approaches. Results demonstrate that the proposed lightweight, reliable, and plug-and-play document enhancement solution effectively bridges lexical gaps while preserving efficiency and deployment flexibility.

Technology Category

Application Category

📝 Abstract
Traditional query expansion techniques for addressing vocabulary mismatch problems in information retrieval are context-sensitive and may lead to performance degradation. As an alternative, document expansion research has gained attention, but existing methods such as Doc2Query have limitations including excessive preprocessing costs, increased index size, and reliability issues with generated content. To mitigate these problems and seek more structured and efficient alternatives, this study proposes a method that divides documents into chunk units and generates textual data for each chunk to simultaneously improve retrieval efficiency and accuracy. The proposed "Chunk Knowledge Generation Model" adopts a T5-based multi-task learning structure that simultaneously generates titles and candidate questions from each document chunk while extracting keywords from user queries. This approach maximizes computational efficiency by generating and extracting three types of semantic information in parallel through a single encoding and two decoding processes. The generated data is utilized as additional information in the retrieval system. GPT-based evaluation on 305 query-document pairs showed that retrieval using the proposed model achieved 95.41% accuracy at Top@10, demonstrating superior performance compared to document chunk-level retrieval. This study contributes by proposing an approach that simultaneously generates titles and candidate questions from document chunks for application in retrieval pipelines, and provides empirical evidence applicable to large-scale information retrieval systems by demonstrating improved retrieval accuracy through qualitative evaluation.
Problem

Research questions and friction points this paper is trying to address.

Addresses vocabulary mismatch in information retrieval systems
Overcomes document expansion limitations like preprocessing costs
Generates structured semantic data from document chunks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk-based document division for retrieval
T5 multi-task learning for parallel generation
Single encoding with dual decoding efficiency
🔎 Similar Papers
No similar papers found.
J
Jisu Kim
HANCOM / Seongnam, South Korea
J
Jinhee Park
HANCOM / Seongnam, South Korea
C
Changhyun Jeon
HANCOM / Seongnam, South Korea
J
Jungwoo Choi
HANCOM / Seongnam, South Korea
Keonwoo Kim
Keonwoo Kim
NAVER Cloud
Natural Language Processing
M
Minji Hong
HANCOM / Seongnam, South Korea
S
Sehyun Kim
HANCOM / Seongnam, South Korea