Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the effectiveness and robustness of small language models (SLMs) for function-calling tasks on resource-constrained edge devices. We comprehensively assess zero-shot, few-shot, and supervised fine-tuning paradigms, integrating prompt engineering with ARM-based deployment, and quantify performance across four dimensions: output format compliance, semantic accuracy, inference latency (<300 ms), and memory footprint (<1 GB). Our study reveals, for the first time, a critical trade-off: SLMs exhibit weak adherence to output formatting specifications yet demonstrate high resilience against prompt injection attacks. We open-source multiple lightweight fine-tuned models. Empirical results show that fine-tuning substantially improves semantic accuracy, while structural output compliance remains a key bottleneck. The work establishes a methodological framework and empirical benchmark for practical SLM deployment in edge intelligence scenarios.

Technology Category

Application Category

📝 Abstract
Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small Language Models (SLMs) can operate efficiently, offering faster response times, and lower computational demands, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection, while also providing the finetuned models to facilitate future applications. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Small Language Models for function calling efficiency
Comparing SLM performance in zero-shot, few-shot, and fine-tuning scenarios
Assessing SLM practicality on edge devices via latency and memory metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating SLMs for function calling efficiency
Using zero-shot, few-shot, and fine-tuning approaches
Testing SLMs on edge devices for latency
🔎 Similar Papers
No similar papers found.