ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing audio-text retrieval (ATR) methods suffer from the gradient locality bottleneck (GLB) induced by small-batch contrastive learning, limiting their capacity to model fine-grained and long-tail semantics beyond the batch. While external knowledge enhancement alleviates GLB, it introduces representation drift mismatch (RDM)—a misalignment between static knowledge bases and dynamically evolving model representations. To address both challenges jointly, this paper proposes the Adaptive Self-Optimizing Knowledge (ASOK) framework, the first to systematically decouple and co-resolve GLB and RDM. ASOK integrates three core innovations: (1) multi-granularity knowledge injection, (2) dynamic graph neural network–driven knowledge refinement, and (3) an adaptive reliability-weighting mechanism under cross-modal embedding alignment. Evaluated on AudioCaps and Clotho benchmarks, ASOK achieves state-of-the-art performance, with significant gains in retrieval accuracy for fine-grained and long-tail samples.

Technology Category

Application Category

📝 Abstract

The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.

Problem

Research questions and friction points this paper is trying to address.

Addresses Gradient Locality Bottleneck limiting out-of-batch knowledge use

Mitigates Representation-Drift Mismatch from static knowledge base misalignment

Proposes adaptive framework for dynamic knowledge refinement in retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-grained knowledge injection breaks gradient locality bottleneck

Dynamic knowledge refinement mitigates representation-drift mismatch

Adaptive reliability weighting ensures consistent knowledge optimization

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation