🤖 AI Summary
This work addresses the notable deficiency of large language models (LLMs) in grasping human intent behind utterances and the absence of a systematic evaluation benchmark. To this end, we introduce IntentGrasp—the first large-scale, multi-domain (spanning 12 domains), structurally unified benchmark for intent understanding—aggregating 49 high-quality corpora with 260K training samples and two evaluation sets, featuring contextualized intent labels and a standardized task format. We further propose Intent Fine-Tuning (IFT) and employ leave-one-domain-out (Lodo) cross-validation to rigorously assess cross-domain generalization. Experiments reveal that prevailing LLMs perform poorly on IntentGrasp (All Set F1 < 60%, Gem Set < 25%), whereas IFT substantially improves performance (gains of over +30 F1 points on All Set and +20 on Gem Set), significantly narrowing the gap to human-level performance (~81.1% F1).
📝 Abstract
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.