🤖 AI Summary
This work addresses the high latency and strong GPU dependency of existing visual document retrieval methods that rely on large-parameter multimodal encoders for processing plain-text queries. To overcome these limitations, the authors propose an asymmetric knowledge distillation framework that transfers the document indexing capability of a 2-billion-parameter vision-language teacher model to a lightweight, text-only student model based on DistilBERT with only 69 million parameters. This enables decoupling of offline indexing from online inference. By leveraging a pointwise cosine alignment distillation objective and machine translation–augmented cross-lingual data, the approach significantly enhances multilingual retrieval performance. The resulting model, NanoVDR-S-Multi, achieves 95.1% of the teacher’s performance across 22 ViDoRe datasets while reducing model size by 32×, cutting CPU query latency by 50×, and requiring less than 13 GPU hours for training.
📝 Abstract
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.