ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limited interpretability and poor generalizability of existing self-reflection mechanisms in language models. The authors propose a causal analysis–based approach to model self-reflection trajectories, employing a three-stage Invariant Causal Prediction (ICP) pipeline to construct causal graphs that identify the true semantic behaviors—and their hierarchical structure—underlying self-reflection outcomes. Their analysis reveals, for the first time, that only a sparse subset of semantic behaviors exhibits generalization capability, and that the concurrent presence of multiple positive behaviors can paradoxically diminish overall effectiveness. Experimental results demonstrate that the identified sparse set of causal parent nodes yields significantly improved stability on out-of-distribution data (p = .013, η²ₚ = .071), with a 49.6% increase in structural likelihood.

Technology Category

Application Category

📝 Abstract

While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6\%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, \eta^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.

Problem

Research questions and friction points this paper is trying to address.

self-reflection

causal analysis

behavior hierarchy

language models

generalizability

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal analysis

self-reflection

behavior hierarchy