🤖 AI Summary
Existing studies lack a deep mechanistic understanding of how function calling (FC) enhances large language models’ (LLMs’) instruction following and safety.
Method: This paper introduces, for the first time, causal intervention techniques at both layer-level and token-level to systematically dissect FC’s impact on internal representations and reasoning pathways. Experiments are conducted across mainstream LLMs and two benchmark datasets.
Contribution/Results: We find that FC significantly strengthens neural activations associated with compliance in critical layers, thereby improving accuracy in intent interpretation and malicious input detection. Specifically, FC achieves an average 135% performance gain over conventional prompting methods on malicious input detection—substantially boosting LLM safety robustness. Our work establishes a novel paradigm for interpretable and controllable LLM behavior regulation through fine-grained, causally grounded intervention.
📝 Abstract
Function calling (FC) has emerged as a powerful technique for facilitating large language models (LLMs) to interact with external systems and perform structured tasks. However, the mechanisms through which it influences model behavior remain largely under-explored. Besides, we discover that in addition to the regular usage of FC, this technique can substantially enhance the compliance of LLMs with user instructions. These observations motivate us to leverage causality, a canonical analysis method, to investigate how FC works within LLMs. In particular, we conduct layer-level and token-level causal interventions to dissect FC's impact on the model's internal computational logic when responding to user queries. Our analysis confirms the substantial influence of FC and reveals several in-depth insights into its mechanisms. To further validate our findings, we conduct extensive experiments comparing the effectiveness of FC-based instructions against conventional prompting methods. We focus on enhancing LLM safety robustness, a critical LLM application scenario, and evaluate four mainstream LLMs across two benchmark datasets. The results are striking: FC shows an average performance improvement of around 135% over conventional prompting methods in detecting malicious inputs, demonstrating its promising potential to enhance LLM reliability and capability in practical applications.