CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the high computational overhead incurred when single-device large language models (LLMs) repeatedly process full dialogue histories during multi-turn interactive subtasks (e.g., query rewriting, summarization), this paper proposes CIFLEX—a lightweight, on-device framework for efficient multi-task conversational inference. Methodologically, CIFLEX introduces: (1) a context instruction flow mechanism that reuses the primary task’s key-value (KV) cache; (2) isolated side-path injection of task-specific instructions; (3) a lightweight hierarchical binary classifier for dynamic subtask selection; and (4) context rollback to ensure state consistency. Experimental results demonstrate that CIFLEX significantly reduces inference latency and memory footprint—without degrading primary or subtask performance—enabling, for the first time, low-overhead, scalable, on-device multi-task collaborative dialogue.

Technology Category

Application Category

📝 Abstract

We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in multi-turn LLM interactions

Enables efficient sub-task switching via KV cache reuse

Optimizes on-device multi-task dialogue with minimal performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses KV cache from main task execution

Injects task-specific instructions into side paths

Rolls back to main path via cached context

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation