Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting pathological speech signals from noisy conversation-level labels presents significant challenges, including scarce annotations, high label subjectivity, and non-uniform distribution of disease-related acoustic features. To address these issues, this work proposes an end-to-end semi-supervised audio learning framework that introduces, for the first time in medical speech analysis, a multi-granularity modeling mechanism operating jointly across frame-level, segment-level, and conversation-level representations. By integrating dynamic feature aggregation with high-quality pseudo-label generation, the framework efficiently leverages unlabeled clinical dialogue data. The approach substantially improves data efficiency and generalization, achieving 90% of fully supervised performance with only 11 labeled samples and demonstrating strong robustness across cross-lingual and cross-disease scenarios.

Technology Category

Application Category

📝 Abstract
Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient's speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90\% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.
Problem

Research questions and friction points this paper is trying to address.

weakly-supervised learning
disease detection from speech
data scarcity
clinical annotation subjectivity
pathological speech patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning
multi-level modeling
medical speech analysis
pseudo-labeling
weakly-supervised learning
🔎 Similar Papers
No similar papers found.
X
Xingyuan Li
X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
Mengyue Wu
Mengyue Wu
Shanghai Jiao Tong University
Speech perception and productionaffective computingaudio cognition