Searching for Difficult-to-Translate Test Examples at Scale

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Efficiently identifying the most challenging translation test instances from large-scale web data remains computationally intractable. Method: We formalize hard-example search as a multi-armed bandit (MAB) problem, modeling “seed topics” as arms and introducing a stochastic difficulty feedback mechanism grounded in machine translation evaluation metrics to dynamically prioritize high-difficulty topics under fixed computational budgets. Contribution/Results: Our framework significantly outperforms brute-force baselines in hard-example discovery efficiency, enabling more effective localization of the most translationally difficult topic instances within resource constraints. The core innovation lies in the cross-paradigm integration of MAB optimization with MT evaluation—establishing a scalable, cost-effective, automated paradigm for robustness assessment of NLP models.

Technology Category

Application Category

📝 Abstract

NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.

Problem

Research questions and friction points this paper is trying to address.

Finding difficult-to-translate examples at scale efficiently

Modeling topic difficulty as stochastic multi-armed bandit problem

Identifying most challenging translation topics within computational budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes topic difficulty search as multi-armed bandit problem

Treats each topic as an arm with stochastic difficulty

Uses bandit strategies to efficiently identify challenging topics

🔎 Similar Papers

Automated Test Case Repair Using Language Models