Knee-Deep in C-RASP: A Transformer Depth Hierarchy

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

A rigorous theoretical characterization of the relationship between Transformer depth and expressive power remains lacking—particularly regarding the mechanism underlying length generalization in position-encoding-free (PE-free) models. Method: We establish, for the first time, a precise correspondence between depth and expressivity by introducing C-RASP—a restricted variant of RASP—and Counting Temporal Logic (CTL#) to formally characterize sequence dependency complexity. We conduct theoretical analysis and empirical validation on a subclass of Transformers with fixed-precision floating-point arithmetic. Contribution: We prove that increased depth strictly enhances expressive power; we derive an exact matching principle between minimal depth and task complexity; and we provide the first theoretical prediction—and experimental confirmation—of the length generalization boundary for PE-free Transformers. These results yield interpretable, task-driven guidelines for depth selection in model design.

Technology Category

Application Category

📝 Abstract

It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained with greater depth? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). These results are established by studying a form of temporal logic with counting operators, which was shown equivalent to C-RASP in previous work. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.

Problem

Research questions and friction points this paper is trying to address.

Formally establish capabilities gained with deeper transformers

Prove deeper C-RASP programs are more expressive than shallow ones

Predict depth needed for length-generalization in sequential tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers with fixed precision equivalence to C-RASP

Deeper C-RASP programs enhance expressive capabilities

Temporal logic with counting operators supports depth generalization

🔎 Similar Papers

No similar papers found.