Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the dimensionality mismatch between high-dimensional representations from speech foundation models and lightweight downstream detectors, this paper proposes a dimensionality-reduction-free (DR-free) nested Res2Net backend architecture. The method leverages nested residual connections, multi-scale feature interaction, and high-dimensional feature passthrough—enabling lossless utilization of rich semantic representations without introducing dedicated dimensionality-reduction layers, while supporting end-to-end anti-spoofing classification training. Experimental results demonstrate a 22% improvement in detection performance on CtrSVDD and an 87% reduction in backend computational cost. Moreover, the approach achieves superior robustness and generalization across four major benchmarks: ASVspoof 2021, ASVspoof 2025, PartialSpoof, and In-the-Wild. To our knowledge, this is the first work to enable seamless integration of large-model high-dimensional outputs with lightweight detectors, establishing an efficient, low-overhead paradigm for deepfake speech detection.

Technology Category

Application Category

📝 Abstract

Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.

Problem

Research questions and friction points this paper is trying to address.

Addresses mismatch between high-dimensional speech features and downstream models

Reduces computational costs and parameter overhead without dimensionality reduction

Enhances robustness in speech anti-spoofing across diverse datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight nested back-end architecture

Direct high-dimensional feature processing

Enhanced multi-scale feature extraction

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection