Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

In the era of foundation models, large-scale image retrieval faces the challenge of learning hash representations that simultaneously achieve compactness and discriminability. To address this, we propose CroVCA—a cross-view code alignment framework—that replaces multi-objective optimization and complex pipelines with a single binary cross-entropy loss jointly regularized by coding-rate maximization, thereby unifying binary code alignment and diversity control. We design a lightweight HashCoder MLP network incorporating batch normalization and LoRA-based fine-tuning, enabling efficient encoder adaptation while freezing backbone features. Evaluated on standard benchmarks, CroVCA achieves state-of-the-art performance within only five training epochs: for 16-bit hashing, it requires less than two minutes for unsupervised hashing on COCO and approximately three minutes for supervised hashing on ImageNet-100—significantly improving both training efficiency and retrieval accuracy.

Technology Category

Application Category

📝 Abstract

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

Problem

Research questions and friction points this paper is trying to address.

Learning compact binary codes for efficient image retrieval

Aligning hash codes across semantically consistent image views

Overcoming computational complexity of high-dimensional foundation model embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-View Code Alignment for binary code consistency

Single binary cross-entropy loss with coding-rate regularization

Lightweight HashCoder network enabling rapid frozen or LoRA training

🔎 Similar Papers

No similar papers found.