🤖 AI Summary
This work addresses the limitations of current scientific document retrieval methods, which predominantly rely on document image representations and struggle to effectively leverage critical evidence embedded in structured content such as text, tables, and mathematical formulas. To this end, the authors introduce ArXivDoc, a novel benchmark constructed from LaTeX source code that enables controlled query generation, facilitating a systematic evaluation of textual, visual, and multimodal representations for retrieval. Experimental results demonstrate that textual representations consistently outperform others across diverse query types; multimodal approaches combining text and images achieve substantial gains over image-only methods without requiring specialized training; and image-based representations exhibit significant performance degradation with increasing document length, particularly for structured content. This study thus exposes fundamental shortcomings of image-centric paradigms and establishes a new foundation for advancing scientific document retrieval.
📝 Abstract
Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.