RW-TTT: Batched Serving for Request-Owned Test-Time Training State

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenge posed by test-time training (TTT), which dynamically updates request-specific states during generation, thereby violating the assumption of static weight sharing in conventional large language model batching and leading to either inefficient serial execution or state contamination in batched processing. The paper formally defines the read-write TTT serving problem and introduces RW-TTT, a method that enables compatibility-aware, fine-grained batching through request identifiers, version tracking, and read/write effect annotations. Leveraging an ownership mechanism, RW-TTT ensures state updates are committed exclusively to their originating requests. The approach supports diverse TTT state representations—including fast weights and low-rank deltas—and achieves a throughput of 274.61 tokens/s on a single GPU when processing eight InPlace-TTT streams, yielding 9.31× speedup over serial execution and 3.44× over independent replicas, while preserving behavioral correctness on the RULER long-context benchmark.

📝 Abstract

Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.

Problem

Research questions and friction points this paper is trying to address.

test-time training

batched serving

request-owned state

LLM inference

state corruption

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time training

batched serving

request-owned state