ConSensus: Multi-Agent Collaboration for Multimodal Sensing

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the challenge that monolithic large language models struggle to balance accuracy and robustness when fusing heterogeneous multimodal sensor data, primarily due to prior biases and vulnerability to missing modalities. To overcome this limitation, the authors propose ConSensus, a novel training-free, single-round multi-agent collaboration framework. ConSensus decomposes perception tasks into modality-specific agents and integrates their outputs through a hybrid fusion strategy combining semantic aggregation and statistical consensus. This approach preserves cross-modal contextual understanding while substantially reducing computational overhead. Evaluated on five standard multimodal perception benchmarks, ConSensus achieves an average accuracy improvement of 7.1% over existing methods, matching the performance of iterative multi-agent debate approaches while reducing fusion token consumption by a factor of 12.7.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world. However, accurately interpreting heterogeneous multimodal sensor data remains a fundamental challenge. We show that a single monolithic LLM often fails to reason coherently across modalities, leading to incomplete interpretations and prior-knowledge bias. We introduce ConSensus, a training-free multi-agent collaboration framework that decomposes multimodal sensing tasks into specialized, modality-aware agents. To aggregate agent-level interpretations, we propose a hybrid fusion mechanism that balances semantic aggregation, which enables cross-modal reasoning and contextual understanding, with statistical consensus, which provides robustness through agreement across modalities. While each approach has complementary failure modes, their combination enables reliable inference under sensor noise and missing data. We evaluate ConSensus on five diverse multimodal sensing benchmarks, demonstrating an average accuracy improvement of 7.1% over the single-agent baseline. Furthermore, ConSensus matches or exceeds the performance of iterative multi-agent debate methods while achieving a 12.7 times reduction in average fusion token cost through a single-round hybrid fusion protocol, yielding a robust and efficient solution for real-world multimodal sensing tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal sensing

heterogeneous sensor data

cross-modal reasoning

LLM bias

sensor fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent collaboration

multimodal sensing

hybrid fusion