When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Large language models (LLMs) can imitate individual writing styles, posing an underexplored identity impersonation threat to existing machine-generated text (MGT) detectors. Method: We introduce the first personalized MGT detection benchmark, revealing substantial performance degradation of mainstream detectors under style imitation. We formalize the “feature inversion trap”—a phenomenon where generic discriminative features become ineffective or even misleading in personalized settings—and propose a feature-direction-based evaluation framework. This framework constructs probe datasets dominated by inverted features to quantify detector reliance on error-prone features. Results: Our method accurately predicts both the direction and magnitude of detector performance shifts, achieving an 85% correlation between predicted and actual performance gaps. This establishes a principled approach for diagnosing and mitigating style-imitation vulnerabilities in MGT detection.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the extit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose method, a simple and reliable way to predict detector performance changes in personalized settings. method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting machine-generated text that mimics personal writing styles

Evaluating detector robustness against personalized LLM-generated content

Addressing feature-inversion trap in personalized text detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark for personalized machine-generated text detection

Proposes method predicting detector performance via feature inversion

Identifies latent directions to evaluate detector dependence changes

🔎 Similar Papers

Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods

2024-06-21Journal of Artificial Intelligence ResearchCitations: 6