🤖 AI Summary
This work addresses the trade-off between computational efficiency and modeling quality in tokenizer-free, byte-level language models, where fixed patch sizes impair prediction performance due to contextual lag despite reducing computation and KV cache usage. The authors propose Scratchpad Patching, a novel approach that enables dynamic intra-patch context refreshing for the first time. By leveraging byte-level causal modeling, the method updates token representations within a patch in real time based on observed bytes and inserts temporary scratchpads on demand, guided by prediction entropy as a trigger. This decouples inference computation from patch size, allowing flexible runtime allocation of computational resources. Experiments demonstrate that with 16-byte patches, the model matches the performance of byte-level baselines on both natural language and code tasks while reducing KV cache usage by 16× and lowering inference computation by 3–4×.
📝 Abstract
Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.