The Impossibility of Inverse Permutation Learning in Transformer Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work investigates the expressive limitations of decoder-only Transformers on inverse permutation learning—recovering an original string from its permuted version. We prove theoretically that causal decoder architectures, regardless of depth, cannot solve this task due to their attention masking mechanism, which inherently precludes modeling long-range bidirectional dependencies. To overcome this limitation, we propose two viable alternatives: (i) an encoder-decoder architecture, and (ii) a decoder-only variant augmented with “draft tokens” via input padding. Leveraging symbolic reasoning and attention theory, we formally construct and analyze both frameworks. This study establishes the first impossibility theorem for inverse permutation learning in causal decoders and provides a formal justification for the necessity of intermediate-step generation—e.g., chain-of-thought prompting—in complex reasoning tasks. Our results highlight the fundamental impact of architectural choices on reasoning capability, particularly for tasks requiring global structural reconstruction.

Technology Category

Application Category

📝 Abstract

In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

Problem

Research questions and friction points this paper is trying to address.

Studying inverse permutation learning in decoder-only transformer models

Demonstrating impossibility of learning inverse permutations with decoder-only transformers

Exploring alternative constructions enabling inverse permutation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only transformers cannot learn inverse permutations

Causal attention mask limits expressivity compared to encoder-decoder

Padding with scratch tokens enables inverse permutation learning

🔎 Similar Papers

No similar papers found.