Class-Agnostic Region-of-Interest Matching in Document Images

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing document analysis methods are constrained by predefined categories and fixed granularity, limiting flexible matching for user-specified regions. This paper introduces “class-agnostic Region-of-Interest Matching” (RoI-Matching), a novel task enabling open-set, multi-granularity, vision-prompt-driven cross-document semantic alignment. To support systematic evaluation, we construct RoI-Matching-Bench—the first benchmark with three-tiered difficulty levels—and propose a joint macro-micro evaluation metric. Methodologically, we adopt a Siamese architecture to extract hierarchical visual features and integrate cross-attention mechanisms for fine-grained semantic fusion between source and target documents. Experiments demonstrate that our approach achieves state-of-the-art performance on the proposed benchmark, with remarkable simplicity and scalability. It establishes a robust baseline and a generalizable paradigm for open-set document understanding.

Technology Category

Application Category

📝 Abstract

Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.

Problem

Research questions and friction points this paper is trying to address.

Matching user-customized regions in document images flexibly

Handling multi-granularity and open-set document analysis tasks

Aligning similar semantics across reference and target documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-agnostic flexible ROI matching

Siamese network for multi-level features

Cross-attention for semantic alignment

🔎 Similar Papers

No similar papers found.