On the Robustness of Agentic Function Calling

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This paper addresses the insufficient robustness of function calling (FC) in large language models (LLMs) operating as autonomous agents. We systematically expose performance degradation under natural query perturbations and toolset expansion. To this end, we introduce the first dual-dimensional benchmark for FC robustness, built upon an extended Berkeley Function Calling Leaderboard (BFCL), incorporating both semantically similar tool injection and natural-language variation–based perturbation test sets. Experiments reveal that state-of-the-art FC models suffer over 35% accuracy drop under query variants and exhibit significantly increased mis-calling rates upon toolset expansion—highlighting critical deployment risks. Our analysis identifies a key limitation in existing evaluations: an overemphasis on accuracy while neglecting stability and consistency. We advocate a paradigm shift from “accuracy-centric” to “reliability-centric” evaluation, providing both theoretical grounding and an actionable benchmark for designing and deploying robust FC agents.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM agents to input perturbations

Evaluating resilience to naturalistic query variations

Testing stability with semantically related toolkit expansions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing FC robustness to naturalistic query variations

Evaluating stability with semantically related toolkit expansion

Identifying weaknesses in current evaluation methodologies

🔎 Similar Papers

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models