🤖 AI Summary
This work addresses the challenge posed by test-time training (TTT), which dynamically updates request-specific states during generation, thereby violating the assumption of static weight sharing in conventional large language model batching and leading to either inefficient serial execution or state contamination in batched processing. The paper formally defines the read-write TTT serving problem and introduces RW-TTT, a method that enables compatibility-aware, fine-grained batching through request identifiers, version tracking, and read/write effect annotations. Leveraging an ownership mechanism, RW-TTT ensures state updates are committed exclusively to their originating requests. The approach supports diverse TTT state representations—including fast weights and low-rank deltas—and achieves a throughput of 274.61 tokens/s on a single GPU when processing eight InPlace-TTT streams, yielding 9.31× speedup over serial execution and 3.44× over independent replicas, while preserving behavioral correctness on the RULER long-context benchmark.
📝 Abstract
Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.