On the Robustness of Agentic Function Calling

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the insufficient robustness of function calling (FC) in large language models (LLMs) operating as autonomous agents. We systematically expose performance degradation under natural query perturbations and toolset expansion. To this end, we introduce the first dual-dimensional benchmark for FC robustness, built upon an extended Berkeley Function Calling Leaderboard (BFCL), incorporating both semantically similar tool injection and natural-language variation–based perturbation test sets. Experiments reveal that state-of-the-art FC models suffer over 35% accuracy drop under query variants and exhibit significantly increased mis-calling rates upon toolset expansion—highlighting critical deployment risks. Our analysis identifies a key limitation in existing evaluations: an overemphasis on accuracy while neglecting stability and consistency. We advocate a paradigm shift from “accuracy-centric” to “reliability-centric” evaluation, providing both theoretical grounding and an actionable benchmark for designing and deploying robust FC agents.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.
Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM agents to input perturbations
Evaluating resilience to naturalistic query variations
Testing stability with semantically related toolkit expansions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing FC robustness to naturalistic query variations
Evaluating stability with semantically related toolkit expansion
Identifying weaknesses in current evaluation methodologies
🔎 Similar Papers
No similar papers found.