TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small language models (SLMs, 1–3B parameters) struggle with accuracy and stability in multi-turn, complex agent tasks—such as tool and API invocation—when deployed standalone on edge devices, limiting their utility without cloud infrastructure. Method: We propose a hybrid optimization framework integrating supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning, and direct preference optimization (DPO). Leveraging AgentBank, we construct a high-quality DPO dataset; rejection samples are generated by TinyLlama and rigorously validated by human annotators. Contribution/Results: Our optimized SLM achieves 65.74% overall accuracy and 55.62% multi-turn task accuracy on the BFCL benchmark—substantially outperforming baselines. This work provides the first systematic empirical validation of lightweight, autonomous agents operating effectively on resource-constrained edge devices, demonstrating both practical feasibility and performance efficacy without cloud dependency.

Technology Category

Application Category

📝 Abstract
This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED), including our conversion of SFT data to chosen-rejected pairs using TinyLlama responses as rejected outputs and manual validation. Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (<1B parameters), achieving up to 65.74% overall accuracy, and 55.62% multi-turn accuracy with hybrid optimization. This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices, making privacy-preserving, low-latency autonomous agents practical beyond the cloud.
Problem

Research questions and friction points this paper is trying to address.

Evaluating small language models for agentic tasks on edge devices
Optimizing models via fine-tuning and hybrid methods for accuracy
Enabling privacy-preserving, low-latency autonomous agents without cloud reliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid optimization strategies for small language models
Parameter-driven fine-tuning including SFT, PEFT, and DPO
Evaluation on edge devices using BFCL framework
🔎 Similar Papers
No similar papers found.
Mohd Ariful Haque
Mohd Ariful Haque
Research Assistant
LLMDeep LearningMachine LearningComputer SecurityDatabase Design
F
Fahad Rahman
Department of Computer Science and Engineering, United International University, Bangladesh
Kishor Datta Gupta
Kishor Datta Gupta
Assistant Professor of Computer Science, Clark Atlanta University | Senior Member, IEEE
Physics Guided Neural NetworkPhysics Informed Machine LearningContext-Aware Machine learning
K
Khalil Shujaee
Department of Cyber-Physical Systems, Clark Atlanta University, USA
R
Roy George
Department of Cyber-Physical Systems, Clark Atlanta University, USA