🤖 AI Summary
This work addresses the instability of traditional tabular anomaly detection methods under distribution shifts, missing data, and rare anomalies, as well as their lack of interpretability regarding model disagreements. To overcome these limitations, we propose MAD (Multi-Agent Debate), a novel framework that introduces multi-agent debate mechanisms into anomaly detection for the first time. MAD integrates normalized anomaly scores, confidence estimates, and structured evidence from multiple detectors through a coordination layer and leverages a large language model (LLM) as a critic to enhance reasoning, producing auditable debate trajectories and consolidated scores. The framework unifies paradigms such as mixture-of-experts and advice learning, while supporting mathematically provable regret bounds and conformal calibration. Experiments demonstrate that MAD significantly improves robustness across multiple benchmarks, effectively controls false positive rates, and provides clear traceability of model disagreements.
📝 Abstract
Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare-anomaly regimes. We propose MAD, a Multi-Agent Debating framework that treats this disagreement as a first-class signal and resolves it through a mathematically grounded coordination layer. Each agent is a machine learning (ML)-based detector that produces a normalized anomaly score, confidence, and structured evidence, augmented by a large language model (LLM)-based critic. A coordinator converts these messages into bounded per-agent losses and updates agent influence via an exponentiated-gradient rule, yielding both a final debated anomaly score and an auditable debate trace. MAD is a unified agentic framework that can recover existing approaches, such as mixture-of-experts gating and learning-with-expert-advice aggregation, by restricting the message space and synthesis operator. We establish regret guarantees for the synthesized losses and show how conformal calibration can wrap the debated score to control false positives under exchangeability. Experiments on diverse tabular anomaly benchmarks show improved robustness over baselines and clearer traces of model disagreement