Exploiting Leaderboards for Large-Scale Distribution of Malicious Models

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical security vulnerability: AI model leaderboards are being exploited to disseminate malicious models—such as those embedded with backdoors or bias—at scale. Addressing the low propagation efficiency and high detectability of existing model poisoning attacks, we propose TrojanClimb, the first general adversarial framework specifically designed for leaderboard manipulation. TrojanClimb combines adversarial fine-tuning with behaviorally stealthy payload injection to embed cross-modal malicious functionality (e.g., text-, speech-, and image-based triggers) while preserving top-tier performance on standard benchmark metrics. Extensive experiments across four major AI modalities demonstrate successful generation of highly effective and covert malicious models, confirming broad applicability. This study constitutes the first systematic investigation revealing fundamental security flaws in leaderboard-based model evaluation and distribution, providing both a critical warning and a technical benchmark for developing robust model governance frameworks.

Technology Category

Application Category

📝 Abstract
While poisoning attacks on machine learning models have been extensively studied, the mechanisms by which adversaries can distribute poisoned models at scale remain largely unexplored. In this paper, we shed light on how model leaderboards -- ranked platforms for model discovery and evaluation -- can serve as a powerful channel for adversaries for stealthy large-scale distribution of poisoned models. We present TrojanClimb, a general framework that enables injection of malicious behaviors while maintaining competitive leaderboard performance. We demonstrate its effectiveness across four diverse modalities: text-embedding, text-generation, text-to-speech and text-to-image, showing that adversaries can successfully achieve high leaderboard rankings while embedding arbitrary harmful functionalities, from backdoors to bias injection. Our findings reveal a significant vulnerability in the machine learning ecosystem, highlighting the urgent need to redesign leaderboard evaluation mechanisms to detect and filter malicious (e.g., poisoned) models, while exposing broader security implications for the machine learning community regarding the risks of adopting models from unverified sources.
Problem

Research questions and friction points this paper is trying to address.

Large-scale distribution of poisoned ML models via leaderboards
Stealthy injection of malicious behaviors in competitive models
Vulnerability in leaderboard evaluation mechanisms for detecting threats
Innovation

Methods, ideas, or system contributions that make the work stand out.

TrojanClimb framework for stealthy poisoned model distribution
Maintains high leaderboard rankings with hidden malicious functionalities
Effective across text, speech, and image generation modalities
🔎 Similar Papers
No similar papers found.