Who Speaks What from Afar: Eavesdropping In-Person Conversations via mmWave Sensing

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

In mmWave radar-based long-range eavesdropping scenarios, attributing speech segments to individual speakers remains challenging due to the absence of prior knowledge about speaker count, identities, or spatial layout. Method: This paper proposes an unsupervised, prior-free voice attribution separation framework. Leveraging only ubiquitous indoor object reflections, it models speaker-specific frequency-domain vibrational signatures induced by vocal activity and introduces a cross-object signal fusion mechanism to enhance feature robustness. The method integrates mmWave sensing, frequency-domain vibration feature extraction, noise-robust unsupervised clustering, and a deep learning fusion architecture. Contribution/Results: Evaluated in realistic meeting environments, the approach achieves 99% voice attribution classification accuracy, significantly improving speech intelligibility and SNR across long and varying distances. It represents the first zero-prior solution for multi-speaker voice attribution, establishing a novel paradigm for privacy-preserving and secure conferencing systems.

Technology Category

Application Category

📝 Abstract

Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.

Problem

Research questions and friction points this paper is trying to address.

Identifies which participant speaks specific content in multi-person meetings.

Enhances speech quality from eavesdropped vibrations using deep learning.

Enables remote eavesdropping without prior knowledge of meeting setup.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages spatial diversity of objects for participant differentiation

Uses unsupervised frequency-domain analysis for robust speaker distinction

Applies deep learning to combine signals for speech enhancement

🔎 Similar Papers

No similar papers found.