🤖 AI Summary
This work addresses key challenges in deep learning–based respiratory sound auscultation, including the loss of transient events during spectrogram conversion, lack of clinical context, and data scarcity exacerbated by class imbalance. To overcome these limitations, the authors propose a multimodal autonomous diagnostic system driven by an Active Adversarial Curriculum Agent (Thinker-A²CA). The system employs a closed-loop strategy to schedule and synthesize hard examples, integrates electronic health records (EHR) with audio features, and introduces a Modality-Weaving Diagnoser to fuse multimodal tokens. Furthermore, a Flow Matching generator is leveraged to disentangle pathological semantics from acoustic style for enhanced data augmentation. Evaluated on Resp-229k—a newly curated benchmark comprising 229k samples—the framework demonstrates significantly improved diagnostic robustness under long-tailed distributions, outperforming existing methods across multiple metrics.
📝 Abstract
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.