Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Detecting pathological speech signals from noisy conversation-level labels presents significant challenges, including scarce annotations, high label subjectivity, and non-uniform distribution of disease-related acoustic features. To address these issues, this work proposes an end-to-end semi-supervised audio learning framework that introduces, for the first time in medical speech analysis, a multi-granularity modeling mechanism operating jointly across frame-level, segment-level, and conversation-level representations. By integrating dynamic feature aggregation with high-quality pseudo-label generation, the framework efficiently leverages unlabeled clinical dialogue data. The approach substantially improves data efficiency and generalization, achieving 90% of fully supervised performance with only 11 labeled samples and demonstrating strong robustness across cross-lingual and cross-disease scenarios.

Technology Category

Application Category

📝 Abstract

Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient's speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90\% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.

Problem

Research questions and friction points this paper is trying to address.

weakly-supervised learning

disease detection from speech

data scarcity

clinical annotation subjectivity

pathological speech patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning

multi-level modeling

medical speech analysis