🤖 AI Summary
This work addresses the significant performance degradation of existing contrastive learning–based image-text retrieval methods, such as CLIP, when handling compositional or fine-grained textual queries. To overcome this limitation, the authors propose a novel paradigm that leverages generative AI to translate complex text queries into visually grounded images, thereby reformulating image-text retrieval as image-to-image retrieval. They further introduce a Monte Carlo estimation–based weighted rank fusion strategy to integrate results from multiple visual encoders, accompanied by theoretical error bounds. Evaluated on challenging benchmarks, the proposed approach achieves up to a 93% relative improvement in mean average precision over the strongest baseline while maintaining sub-second query latency. The system is implemented via a microservices architecture, efficiently deployed with PostgreSQL and Milvus for scalable and responsive operation.
📝 Abstract
We demonstrate NeedleDB, an open-source, deployment-ready database system for answering complex natural language queries over image data. Unlike existing approaches that rely on contrastive-learning embeddings (e.g., CLIP), which degrade on compositional or nuanced queries, NeedleDB leverages generative AI to synthesize guide images that represent the query in the visual domain, transforming the text-to-image retrieval problem into a more tractable image-to-image search. The system aggregates nearest-neighbor results across multiple vision embedders using a weighted rank-fusion strategy grounded in a Monte Carlo estimator with provable error bounds. NeedleDB ships with a full-featured command-line interface (needlectl), a browser-based Web UI, and a modular microservice architecture backed by PostgreSQL and Milvus. On challenging benchmarks, it improves Mean Average Precision by up to 93% over the strongest baseline while maintaining sub-second query latency. In our demonstration, attendees interact with NeedleDB through three hands-on scenarios that showcase its retrieval capabilities, data ingestion workflow, and pipeline configurability.