Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing document image retrieval methods primarily rely on image-based queries, limiting their effectiveness for fine-grained natural language queries. This paper introduces Natural Language–Driven Document Image Retrieval (NL-DIR), a novel task requiring semantic alignment between unstructured text queries and document images. We establish the first dedicated benchmark—comprising 41K real-world document images and 5× fine-grained, human-verified textual queries—and propose text-aware evaluation metrics. Our two-stage retrieval framework combines OCR-free coarse retrieval with semantic re-ranking, enabling efficient and accurate matching. Comprehensive evaluation of zero-shot and fine-tuned vision-language models (e.g., CLIP) reveals fundamental bottlenecks in fine-grained cross-modal alignment. Experiments demonstrate that our framework significantly improves retrieval accuracy while maintaining computational efficiency. The dataset, code, and models are publicly released to foster future research.

Technology Category

Application Category

📝 Abstract

Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

Problem

Research questions and friction points this paper is trying to address.

Develops a natural language-based benchmark for document image retrieval.

Addresses limitations of existing methods with fine-grained semantic queries.

Evaluates models using a dataset of 41K document images and language descriptions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces natural language queries for document image retrieval

Creates dataset with 41K images and fine-grained semantic queries

Evaluates contrastive vision-language and OCR-free models

🔎 Similar Papers

ColPali: Efficient Document Retrieval with Vision Language Models