Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the challenge of robust music retrieval in industrial settings, where user queries often deviate from metadata due to spelling errors or phonetic variations, and traditional n-gram methods suffer from poor semantic robustness and high noise, particularly for long-tail queries. The authors propose a neural sparse retrieval system that requires no query-side inference: it leverages fine-grained subword tokenization (≤3 characters) combined with surface-form-constrained neural sparse training to suppress lexical memorization and enhance robustness. Through domain-adaptive pretraining and offline precomputation of embeddings and term expansions during indexing, the online phase only requires lightweight tokenization and IDF weighting, enabling millisecond-level fuzzy matching with low latency. Evaluated on 6 million production documents, the method achieves 91.4% recall@10—substantially outperforming a trigram baseline at 57.7%—with human-computer interaction simulations showing a consistent 0.8% recall gain, comparable throughput, and zero query encoding overhead.

📝 Abstract

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

Problem

Research questions and friction points this paper is trying to address.

neural sparse retrieval

fuzzy matching

music search

query robustness

long-tail queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

neural sparse retrieval

inference-free

surface-form robustness