Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the challenge of efficient large language model (LLM) deployment on resource-constrained edge devices, this paper proposes a three-tier collaborative framework: “task decomposition—dependency scheduling—adaptive allocation.” Methodologically, it introduces a lightweight task decomposer and a subtask dependency graph scheduler, constructs plug-and-play parameter-freezing adapters, and pioneers a self-reinforced training mechanism driven solely by execution feedback—enabling dynamic collaborative inference between local small models and cloud-based LLMs. Experiments demonstrate a 66.12% average reduction in inference latency, an 83.57% decrease in API invocation cost, and inference accuracy matching state-of-the-art baselines. Key contributions include: (1) the first three-tier collaborative paradigm for edge-cloud LLM orchestration; (2) a label-free, execution-driven self-reinforced training method; and (3) a lightweight interface mechanism supporting zero-shot fine-tuning adaptation.

Technology Category

Application Category

📝 Abstract

The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of AI on local devices

Collaborative reasoning between local and cloud models

Reducing costs while maintaining reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid language model synergy

Task decomposer for sub-tasks

Plug-and-play adapter allocation

🔎 Similar Papers

No similar papers found.