Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting undesirable behaviors—such as factual inaccuracies, safety violations, toxicity, and backdoor attacks—in large language models (LLMs), without fine-tuning, supervision, or task-specific adaptation. Inspired by microsaccades in biological vision, we propose a lightweight, unsupervised self-diagnostic method that introduces fine-grained positional encoding perturbations to activate latent “self-failure alert” signals inherently embedded in pretrained LLMs. Our approach integrates positional perturbation, cross-modal analogy modeling, and unsupervised response analysis. Evaluated across multiple mainstream LLMs, it significantly improves detection rates for diverse undesirable responses while incurring negligible computational overhead. The key contribution lies in identifying and leveraging the model’s intrinsic robustness boundary as an “internal diagnostic interface,” establishing a novel paradigm for trustworthy LLM evaluation.

Technology Category

Application Category

📝 Abstract
We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.
Problem

Research questions and friction points this paper is trying to address.

Detecting LLM failures through positional encoding perturbations
Identifying model misbehaviors without fine-tuning or supervision
Surfacing internal evidence of failures in pretrained language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Microsaccade-inspired positional encoding perturbations probe LLMs
Perturbations detect failures without fine-tuning or supervision
Computationally efficient method surfaces misbehaviours across diverse settings
🔎 Similar Papers
No similar papers found.
R
Rui Melo
Carnegie Mellon University
Rui Abreu
Rui Abreu
Meta Platforms, Inc and University of Porto/INESC-ID
SWESE4AIAI4SECyberSecQuantum Software
C
Corina S. Pasareanu
Carnegie Mellon University