Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the training instability of rationalized Transformer classifiers and the complexity of multi-agent game-theoretic frameworks. We propose an end-to-end differentiable unified architecture that jointly performs classification, input token importance scoring, and counterfactual classification within a single Transformer—eliminating the conventional three-player adversarial game. Instead, we introduce category-aware attention control and implicit rationale regularization to enforce interpretability. Leveraging parameterized self-training and class-level rationale generation, our method significantly improves alignment with human-annotated rationales without requiring explicit rationale supervision. Evaluated on multiple benchmark tasks, it achieves state-of-the-art performance while effectively mitigating gradient conflicts and optimization oscillations—key sources of training instability in rationalization models.

Technology Category

Application Category

📝 Abstract

We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.

Problem

Research questions and friction points this paper is trying to address.

Develops end-to-end differentiable training for transformer rationalization

Simplifies three-player-game approach into a single efficient model

Enhances class-wise rationale alignment with human annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end differentiable training paradigm

Single model for classification and token scoring

Class-wise rationales with improved alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow