Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in deep learning–based respiratory sound auscultation, including the loss of transient events during spectrogram conversion, lack of clinical context, and data scarcity exacerbated by class imbalance. To overcome these limitations, the authors propose a multimodal autonomous diagnostic system driven by an Active Adversarial Curriculum Agent (Thinker-A²CA). The system employs a closed-loop strategy to schedule and synthesize hard examples, integrates electronic health records (EHR) with audio features, and introduces a Modality-Weaving Diagnoser to fuse multimodal tokens. Furthermore, a Flow Matching generator is leveraged to disentangle pathological semantics from acoustic style for enhanced data augmentation. Evaluated on Resp-229k—a newly curated benchmark comprising 229k samples—the framework demonstrates significantly improved diagnostic robustness under long-tailed distributions, outperforming existing methods across multiple metrics.

Technology Category

Application Category

📝 Abstract
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Problem

Research questions and friction points this paper is trying to address.

respiratory sound
information loss
data scarcity
class imbalance
multimodal diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based system
Multimodal fusion
Flow matching generation
Modality weaving
Respiratory sound diagnosis
🔎 Similar Papers
No similar papers found.
P
Pengfei Zhang
The Hong Kong University of Science and Technology (Guangzhou)
T
Tianxin Xie
The Hong Kong University of Science and Technology (Guangzhou)
M
Minghao Yang
The Hong Kong University of Science and Technology (Guangzhou)
Li Liu
Li Liu
The Hong Kong University of Science and Technology (Guangzhou)
Cued SpeechAudio-Visual Generation and UnderstandingTrustworthy AI