HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing hyperspherical CLIP models require training from scratch, incurring high computational costs and exhibiting limited performance in strict zero-shot visual question answering (VQA). This work proposes the HAC framework, which for the first time enables parameter-efficient hyperbolic adaptation of pretrained CLIP models. By applying lightweight fine-tuning to transfer CLIP into hyperbolic space, HAC integrates hyperbolic geometric embeddings with CLIP’s cross-modal alignment mechanism to construct a representation space tailored for VQA, achieving strong zero-shot generalization without any task-specific data. Experimental results demonstrate that HAC-S and HAC-B consistently outperform Euclidean baselines and existing hyperbolic approaches across general, reasoning, and OCR-based VQA benchmarks, with HAC-B yielding an average improvement of 1.9 percentage points on reasoning tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC
Problem

Research questions and friction points this paper is trying to address.

Hyperbolic Geometry
CLIP
Zero-Shot VQA
Parameter-Efficient Adaptation
Visual Question Answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperbolic geometry
parameter-efficient adaptation
zero-shot VQA
CLIP
representation learning
🔎 Similar Papers
No similar papers found.