Enforcing Speech Content Privacy in Environmental Sound Recordings using Segment-wise Waveform Reversal

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Embedding intelligible speech in environmental sound recordings poses severe privacy risks, hindering data sharing and reuse. To address this, we propose a waveform-level privacy-preserving method based on segment-wise polarity inversion: speech segments are first precisely localized via voice activity detection and speaker separation; then, their local waveforms undergo polarity inversion, followed by randomized concatenation to enhance resistance against adversarial reconstruction. Crucially, this approach degrades speech intelligibility while preserving non-speech acoustic scene structure and overall audio fidelity. Experiments on a synthetic dataset show a 97.9% word error rate (WER), only a 2.7% drop in sound source classification accuracy, and a Fréchet Audio Distance of 1.40—substantially outperforming baseline methods. To our knowledge, this is the first work to combine waveform-level perturbation with randomized reassembly for environmental audio privacy protection, achieving a balanced trade-off among strong privacy guarantees, high perceptual quality, and downstream task utility.

Technology Category

Application Category

📝 Abstract
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fréchet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
Problem

Research questions and friction points this paper is trying to address.

Protect speech privacy in environmental recordings
Preserve acoustic scene integrity and audio quality
Distort speech via segment-wise waveform reversal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-wise waveform reversal for speech distortion
Voice activity detection for precise speech targeting
Random splicing enhances robustness against recovery
🔎 Similar Papers
No similar papers found.