🤖 AI Summary
To address the inefficiency of deploying large language models (LLMs) for composite multi-task inference—such as simultaneous summarization and translation—on resource-constrained edge devices, this paper proposes a lightweight collaborative inference architecture. The core method introduces a learnable projection layer that fuses multiple LoRA adapters, enabling joint modeling of multi-task features and parameter sharing within a single forward pass—thereby eliminating sequential task execution and redundant retraining. This design significantly reduces computational overhead: on Android devices, it achieves a 42% reduction in inference latency and a 38% decrease in memory footprint compared to baseline approaches, while preserving full accuracy on joint translation-summarization tasks. The system supports cloud-based training and on-device deployment, offering an efficient, practical paradigm for composite-task inference under strict resource constraints.
📝 Abstract
Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.