🤖 AI Summary
Efficiently identifying the most challenging translation test instances from large-scale web data remains computationally intractable. Method: We formalize hard-example search as a multi-armed bandit (MAB) problem, modeling “seed topics” as arms and introducing a stochastic difficulty feedback mechanism grounded in machine translation evaluation metrics to dynamically prioritize high-difficulty topics under fixed computational budgets. Contribution/Results: Our framework significantly outperforms brute-force baselines in hard-example discovery efficiency, enabling more effective localization of the most translationally difficult topic instances within resource constraints. The core innovation lies in the cross-paradigm integration of MAB optimization with MT evaluation—establishing a scalable, cost-effective, automated paradigm for robustness assessment of NLP models.
📝 Abstract
NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.