Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the vulnerability of multilingual model collaboration systems to malicious models, which can significantly degrade performance in both reasoning and safety-critical tasks. The study presents the first systematic quantification of the impact of four types of malicious models across four prevalent collaborative architectures, evaluated on ten diverse datasets. To mitigate this threat, the authors propose a real-time defense mechanism leveraging an external supervisor that dynamically isolates malicious components through adaptive masking and routing control. Experimental results demonstrate that malicious models reduce average performance by 7.12% in reasoning tasks and 7.94% in safety tasks, whereas the proposed method recovers over 95.31% of the original system performance, substantially enhancing robustness against such adversarial attacks.

Technology Category

Application Category

📝 Abstract

Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

Problem

Research questions and friction points this paper is trying to address.

malicious models

model collaboration

safety risks

multi-LLM systems

decentralized AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

malicious language models

model collaboration systems

safety mitigation