Hacking Neural Evaluation Metrics with Single Hub Text

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical reliability vulnerability of neural text evaluation metrics (e.g., COMET) in black-box settings: their embedding spaces exhibit structural weaknesses exploitable by adversaries. To this end, we propose the first “single-hub text attack” paradigm—searching for a universal adversarial text in the discrete textual space that consistently induces severe score inflation across arbitrary source sentences. Our method integrates gradient-guided discrete optimization, multilingual embedding alignment, and cross-task robustness validation. Experiments demonstrate that the discovered hub text elevates COMET scores to 79.1% and 67.8% on WMT’24 En–Ja and En–De benchmarks, respectively—surpassing the quality of M2M100 translations—and successfully transfers to reverse language pairs (e.g., Ja–En, De–En). This is the first systematic revelation of generalization-related security risks in embedded evaluation models, highlighting their susceptibility to universal adversarial perturbations in practical deployment.

Technology Category

Application Category

📝 Abstract
Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
Problem

Research questions and friction points this paper is trying to address.

Identifies vulnerabilities in neural text evaluation metrics
Proposes method to find adversarial text bypassing metric reliability
Demonstrates hub text outperforms standard translations across languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single adversarial text exploits metric vulnerabilities
Hub text consistently scores high across test cases
Method generalizes across multiple language pairs
🔎 Similar Papers
No similar papers found.