ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

๐Ÿ“… 2025-12-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing audio-text retrieval (ATR) methods suffer from the gradient locality bottleneck (GLB) induced by small-batch contrastive learning, limiting their capacity to model fine-grained and long-tail semantics beyond the batch. While external knowledge enhancement alleviates GLB, it introduces representation drift mismatch (RDM)โ€”a misalignment between static knowledge bases and dynamically evolving model representations. To address both challenges jointly, this paper proposes the Adaptive Self-Optimizing Knowledge (ASOK) framework, the first to systematically decouple and co-resolve GLB and RDM. ASOK integrates three core innovations: (1) multi-granularity knowledge injection, (2) dynamic graph neural networkโ€“driven knowledge refinement, and (3) an adaptive reliability-weighting mechanism under cross-modal embedding alignment. Evaluated on AudioCaps and Clotho benchmarks, ASOK achieves state-of-the-art performance, with significant gains in retrieval accuracy for fine-grained and long-tail samples.

Technology Category

Application Category

๐Ÿ“ Abstract
The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.
Problem

Research questions and friction points this paper is trying to address.

Addresses Gradient Locality Bottleneck limiting out-of-batch knowledge use
Mitigates Representation-Drift Mismatch from static knowledge base misalignment
Proposes adaptive framework for dynamic knowledge refinement in retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-grained knowledge injection breaks gradient locality bottleneck
Dynamic knowledge refinement mitigates representation-drift mismatch
Adaptive reliability weighting ensures consistent knowledge optimization
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Siyuan Fu
University of Electronic Science and Technology of China
X
Xuchen Guo
University of Electronic Science and Technology of China
M
Mingjun Liu
University of Electronic Science and Technology of China
H
Hongxiang Li
Hong Kong University of Science and Technology
Boyin Tan
Boyin Tan
The Chinese University of Hong Kong, Shenzhen
Software EngineeringLLM4Code
Gongxi Zhu
Gongxi Zhu
Tsinghua University
Deep LearningFederated Learning
Xianwei Zhuang
Xianwei Zhuang
Peking University
large vision-language modellarge language model
Jinghan Ru
Jinghan Ru
Peking University
LLM
Yuxin Xie
Yuxin Xie
Peking University
audiomllm
Y
Yuguo Yin
Peking University