InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses a critical vulnerability in current voting-based model leaderboards, such as LMSYS Chatbot Arena, where anonymity can be compromised due to stylistic similarities or shared origins among models. To counter this, the authors propose InterPol, a novel framework that uniquely integrates model interpolation with adaptive curriculum learning. By generating challenging negative samples through interpolation and refining preference data accordingly, InterPol effectively captures nuanced stylistic features, substantially enhancing the ability to distinguish anonymized models. The method significantly outperforms existing baselines in identification accuracy and demonstrates—using real-world Arena battle data—that model rankings are susceptible to manipulation, thereby exposing serious flaws in prevailing anonymity mechanisms.

Technology Category

Application Category

📝 Abstract

Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines in identification accuracy. Furthermore, we quantify the real-world threat of our findings through ranking manipulation simulations on Arena battle data.

Problem

Research questions and friction points this paper is trying to address.

de-anonymization

language model leaderboard

model identification

stylistic similarity

anonymity vulnerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpolated preference learning

model de-anonymization

stylistic pattern recognition