Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between resource constraints for on-device LLM deployment and high latency/cost of cloud-only execution in multimodal, multi-task, and multi-turn dialogue scenarios, this paper proposes a local-cloud collaborative inference offloading framework. Our method introduces (1) a novel Resource-Constrained Reinforcement Learning (RCRL)-driven dynamic offloading policy that jointly optimizes execution location, modality selection, and task routing; (2) M4A1—the first benchmark dataset covering multimodality, multitasking, multi-turn dialogue, and multi-LLM characteristics; and (3) a lightweight local model coupled with a large-scale cloud model, enabling multimodal input fusion and fine-grained task scheduling. Experiments on realistic multi-turn dialogues demonstrate significant reductions in end-to-end latency and cloud invocation cost, while preserving response quality—validating both effectiveness and practicality.

Technology Category

Application Category

📝 Abstract
Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, beyond their large size, make their deployment more challenging during the inference stage. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design a local-cloud LLM inference offloading (LCIO) system, featuring (i) a large-scale cloud LLM that can handle multi-modal data sources and (ii) a lightweight local LLM that can process simple tasks at high speed. LCIO employs resource-constrained reinforcement learning (RCRL) to determine where to make the inference (i.e., local vs. cloud) and which multi-modal data sources to use for each dialogue/task, aiming to maximize the long-term reward (which incorporates response quality, latency, and usage cost) while adhering to resource constraints. We also propose M4A1, a new dataset that accounts for multi-modal, multi-task, multi-dialogue, and multi-LLM characteristics, to investigate the capabilities of LLMs in various practical scenarios. We demonstrate the effectiveness of LCIO compared to baselines, showing significant savings in latency and cost while achieving satisfactory response quality.
Problem

Research questions and friction points this paper is trying to address.

Optimizes local-cloud inference offloading for LLMs.
Balances response quality, latency, and usage costs.
Addresses resource constraints in multi-modal, multi-task settings.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local-cloud LLM inference offloading system
Resource-constrained reinforcement learning optimization
M4A1 dataset for multi-modal LLM evaluation
🔎 Similar Papers
No similar papers found.