๐ค AI Summary
This work addresses the challenge of long-context processing in large language models, which is hindered by the quadratic complexity of the Transformer attention mechanism, leading to high memory consumption and inference latency. Existing context distillation methods suffer from prohibitive training costs and limited practicality. To overcome these limitations, the authors propose Doc-to-LoRA (D2L), a lightweight meta-learning-based hypernetwork that internalizes long-context information into task-specific LoRA adapters in a single forward pass, enabling efficient inference without repeated access to the original context. D2L achieves the first end-to-end trainable approximation of context distillation, supporting sequence lengths up to four times the native context window. It attains near-perfect zero-shot accuracy on the โneedle-in-a-haystackโ task and outperforms conventional distillation approaches on real-world question answering benchmarks, while substantially reducing KV cache memory usage and inference latency.
๐ Abstract
Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.