DualTune: Decoupled Fine-Tuning for On-Device Agentic Systems

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Local large language models (LLMs) face two key bottlenecks in on-device tool calling: accurate tool selection from large-scale tool inventories and precise generation of complex, structured parameters. To address these challenges, we propose the first decoupled training and inference framework that separately models tool selection and parameter generation. Specifically, we design dedicated LoRA adapters for each subtask, optimized under task-specific loss masking. During inference, adapters are dynamically loaded, and a hierarchical scheduling mechanism prunes the candidate tool set to improve efficiency. This approach significantly enhances the tool-calling capability of compact models under resource-constrained edge-device conditions. On the MCP-Bench benchmark, our method boosts tool-call accuracy by 46% for Qwen-2.5-7B—outperforming all comparable 7B baselines and surpassing many 14B models across diverse scenarios.

Technology Category

Application Category

📝 Abstract

The deployment of Large Language Models (LLMs) as agentic orchestrators has revolutionized task automation, but the need for privacy-preserving, cost-effective solutions demands on-device inference capabilities. However, local LLMs consistently underperform compared to frontier models in tool calling scenarios, struggling with both tool selection from large tool sets and accurate argument generation for complex parameter structures. We introduce a methodology that disaggregates a tool-calling task into two distinct subtasks: tool selection and argument generation. We propose "decoupled fine-tuning", a novel post-training approach that employs LoRA fine-tuning to create dedicated LoRA adapters for tool selection and tool-specific argument generation using separate loss masking for each of the subtasks. Furthermore, we present DualTune, an inference framework that leverages the LoRA adapters created using decoupled fine-tuning to perform efficient agent orchestration with the help of local models on end-user devices. DualTune decomposes the tool-call generation step into tool selection and argument generation, and dynamically loads the corresponding LoRA adapters to generate tool calls. Additionally, DualTune implements hierarchical orchestration to restrict the number of tools required for tool selection. Our experiments on the MCP-Bench benchmark demonstrate that the Qwen-2.5-7B model trained using decoupled fine-tuning improves the tool calling accuracy of the base model by 46%, and outperforms other local reasoning, non-reasoning and fine-tuned models of similar size in all cases, and models that are 2x larger, in most cases.

Problem

Research questions and friction points this paper is trying to address.

Improving local LLMs' tool selection accuracy from large sets

Enhancing argument generation for complex tool parameters

Enabling efficient on-device agent orchestration with privacy preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled fine-tuning separates tool selection and argument generation

Creates dedicated LoRA adapters for each subtask via loss masking

Dynamic adapter loading enables efficient on-device agent orchestration

🔎 Similar Papers

Synergy: Towards On-Body AI via Tiny AI Accelerator Collaboration on Wearables