Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the poor robustness of character-level Transformers in modular addition tasks, where models often exhibit catastrophic failures under input position shifts or out-of-distribution (OOD) template variations despite high in-distribution accuracy. The study systematically identifies and attributes this failure to the absence of positional and template invariances in the training data. To remedy this, the authors propose a simple yet effective training strategy that explicitly injects the required invariances by introducing expression boundary tokens, incorporating positional curriculum learning, mixing diverse templates, and enforcing multi-variant consistency. Evaluated on disjoint ordered pairs modulo 97, the approach consistently enhances robustness to both position shifts and template OOD generalization across three random seeds while preserving strong in-distribution performance, whereas the ALiBi baseline fails to learn the task altogether.

Technology Category

Application Category

📝 Abstract

Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions ("position shift") or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.

Problem

Research questions and friction points this paper is trying to address.

position shift

modular arithmetic

out-of-distribution robustness

template diversity

procedural generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

position curriculum

template diversity

procedural generalization