HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

📅 2024-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the robustness of function calling in multi-turn LLM dialogues under realistic mobile scenarios remains challenging due to dynamic user behavior and environmental constraints. Method: This paper introduces the first fine-grained, mobile-oriented benchmark for function-call evaluation. It proposes a hybrid data construction paradigm combining real-world user logs with open-source model-generated dialogues, and designs a turn-level interactive snapshot assessment mechanism that enables dynamic trajectory tracking and parameter-level error attribution. Contribution/Results: It is the first work to enable precise diagnosis of complex phenomena—including imperfect instructions, intent drift, and pronoun reference—within mobile dialogue contexts. Empirical analysis reveals that parameter-name errors (e.g., spelling mistakes, case mismatches, and abbreviations) constitute the dominant cause of cross-scenario failures. The benchmark thus provides an interpretable, localizable foundation for diagnosing and improving the robustness of mobile-assistant LLMs.

Technology Category

Application Category

📝 Abstract
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs' function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in multi-turn dialogues
Assessing function-calling in real scenarios
Identifying parameter errors in mobile assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates diverse mobile assistant use cases
Leverages open-source models for data generation
Provides fine-grained interaction snapshots and metrics
🔎 Similar Papers
No similar papers found.
J
Jun Wang
OPPO Research Institute
J
Jiamu Zhou
OPPO Research Institute
Muning Wen
Muning Wen
Research Assistant Professor, Shanghai Jiao Tong University
(multi-agent) reinforcement learninglanguage agent/LLM-based agent
X
Xiaoyun Mo
OPPO Research Institute
H
Haoyu Zhang
OPPO Research Institute
Q
Qiqiang Lin
OPPO Research Institute
C
Cheng Jin
OPPO Research Institute
Xihuai Wang
Xihuai Wang
Shanghai Jiao Tong University
Reinforcement LearningMulti-agent SystemLanguage Agent
W
Weinan Zhang
Shanghai Jiao Tong University
Qiuying Peng
Qiuying Peng
OPPO Research Institute
artificial intelligence