๐ค AI Summary
This work addresses the high computational overhead, redundant reasoning, and excessive cache usage inherent in traditional multi-agent large language models that rely on natural language communication. The authors propose TFlow, a novel framework that, for the first time, leverages instantaneous low-rank perturbations in weight space as an inter-agent communication medium. Specifically, a sender encodes its internal state into a LoRA perturbation applied directly to the receiverโs model weights, enabling efficient instance-level collaboration without expanding the textual context. Integrating frozen role prompts, a parameter generator, and a dynamic weight fusion mechanism, TFlow achieves up to an 8.5% accuracy gain over a single-model baseline across five benchmarks while reducing token consumption by 32.69%. Compared to a three-agent text-based communication baseline, it decreases total token usage by 83.27% and accelerates inference by up to 4.6ร.
๐ Abstract
Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.