🤖 AI Summary
In visual financial document retrieval, single-vector aggregation is prone to being dominated by global texture patterns, thereby obscuring critical numerical discrepancies and erroneously assigning high similarity to semantically distinct documents. This work introduces the first diagnostic benchmark for financial documents to systematically evaluate the prevalence of this issue across diverse visual encoders, embedding strategies, and aggregation methods. Experimental results demonstrate that patch-level representations effectively capture subtle semantic variations, whereas single-vector aggregation poses significant risks. The study exposes the fragility of current vision-based retrieval-augmented generation (RAG) pipelines and provides empirical evidence and actionable directions for improving document-level visual retrieval in financial contexts.
📝 Abstract
Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.