On-device System of Compositional Multi-tasking in Large Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the inefficiency of deploying large language models (LLMs) for composite multi-task inference—such as simultaneous summarization and translation—on resource-constrained edge devices, this paper proposes a lightweight collaborative inference architecture. The core method introduces a learnable projection layer that fuses multiple LoRA adapters, enabling joint modeling of multi-task features and parameter sharing within a single forward pass—thereby eliminating sequential task execution and redundant retraining. This design significantly reduces computational overhead: on Android devices, it achieves a 42% reduction in inference latency and a 38% decrease in memory footprint compared to baseline approaches, while preserving full accuracy on joint translation-summarization tasks. The system supports cloud-based training and on-device deployment, offering an efficient, practical paradigm for composite-task inference under strict resource constraints.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

Problem

Research questions and friction points this paper is trying to address.

Enabling simultaneous execution of complex tasks

Integrating summarization and translation adapters efficiently

Optimizing on-device performance for resource-constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adds learnable projection layer on combined adapters

Enables efficient compositional multi-tasking integration

Optimizes for on-device deployment with reduced overhead

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs