🤖 AI Summary
This work addresses the limitations of edge-based audio systems, which suffer from weak perceptual capabilities in lightweight models that hinder multi-step reasoning, while full cloud offloading incurs high latency, bandwidth overhead, and privacy risks. To reconcile these challenges, the authors propose CoFi-Agent, a novel edge-cloud collaborative architecture featuring a conditionally triggered “coarse-to-fine” cooperation mechanism. A local 7B Audio-LLM performs rapid initial inference, and only upon detecting uncertainty does a cloud-side controller orchestrate on-device tools—such as temporal re-listening and local ASR—for fine-grained enhancement. Evaluated on the MMAR benchmark, this approach boosts accuracy from 27.20% to 53.60%, substantially outperforming always-on fine-grained analysis, while achieving a superior trade-off between accuracy, resource efficiency, and privacy preservation.
📝 Abstract
Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.