Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the tension between high cloud inference costs and limited on-device capabilities in long-document reasoning tasks (e.g., in finance, healthcare, and scientific domains) under edge-cloud collaboration. We propose MinionS, a lightweight protocol and hybrid inference framework that enables dynamic task decomposition and context partitioning via large cloud models—without any fine-tuning—followed by parallel execution of subtasks on efficient on-device models. This approach overcomes on-device limitations in instruction following and long-context processing. Experiments show that MinionS reduces cloud inference cost by 5.7× compared to cloud-only inference while retaining 97.9% of performance; it also improves accuracy by 10.5 percentage points over baseline collaborative methods. Our core contribution is the first zero-training, low-overhead, high-performance mechanism for dynamic edge-cloud task decomposition and coordinated scheduling.

Technology Category

Application Category

📝 Abstract

We investigate an emerging setup in which a small, on-device language model (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. Can a local-remote collaboration reduce cloud inference costs while preserving quality? First, we consider a naive collaboration protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4x reduction in remote costs, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined MinionS, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MinionS reduces costs by 5.7x on average while recovering 97.9% of the performance of the remote model alone. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Problem

Research questions and friction points this paper is trying to address.

Cost-efficient collaboration between local and cloud models

Reducing cloud inference costs while preserving quality

Enhancing local model's ability to handle complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local-remote model collaboration

Task decomposition into subtasks

Parallel execution of chunks

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Authors to Follow