Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer from inefficient inference on edge devices due to redundant visual tokens; existing token pruning methods rely on architectural modifications, intermediate activations, or saliency modeling—introducing accuracy degradation and misalignment between prompts and image regions. Method: We propose GazeVLM, the first training-free framework that dynamically allocates visual tokens guided by human eye-tracking gaze signals. It emulates foveal-peripheral perception by fusing high-resolution gaze-centered crops with low-resolution global views, enabling multi-scale input, and integrates plug-and-play token pruning. Contribution/Results: Without altering model architecture or accessing intermediate layers, GazeVLM reduces visual tokens by up to 93.1%, total tokens by 59.6%, and FLOPs by 50%, while preserving—or even improving—answer quality. This significantly enhances real-time inference feasibility for AR/VR and other edge applications.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) deliver impressive performance in understanding visual content with language instructions. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs, which hinders real-time use on edge consumer devices such as AR/VR devices. Existing efficiency methods commonly prune visual tokens using learned saliency, sparse attention schedules, or controller policies, but they often require architectural modification or access to intermediate activations. These pipelines add inference-time modules that increase compute and memory and often lead to an accuracy trade-off. Moreover, they also suffer from misalignment between the prompts and the region of interest in the images. Without human guidance, the model may focus on the wrong regions and miss small, high-frequency details when prompts or scenes change. In this paper, we propose GazeVLM, a training-free framework that uses the human eye gaze as a natural supervisory signal to allocate computation where it matters. By extracting gaze-driven regions of interest (ROIs) and optionally combining them with a low-resolution global view, GazeVLM mimics fovea-periphery perception to cut redundant visual tokens while preserving task-relevant details. We evaluate the visual question answering tasks on Qwen2.5-VL-3B/7B on the VOILA-COCO benchmark with human gaze. Quality of the answer is assessed by GPT-4o pairwise judging and a weighted score over coverage, accuracy, details, and fluency. Efficiency is measured by token counts and FLOPs. GazeVLM reduces visual tokens by up to 93.1%, total tokens by up to 59.6%, and FLOPs by 50%, while keeping better answer quality relative to full-resolution baselines. Our results show that aligning model computation with human gaze offers a simple, plug-and-play path toward efficient VLM inference on consumer devices.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models have inefficient inference due to redundant visual tokens
Existing methods require architectural changes and suffer from accuracy trade-offs
Models often misalign with image regions of interest without human guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses human eye gaze as supervisory signal
Extracts gaze-driven regions of interest for computation
Combines ROIs with low-resolution global view
🔎 Similar Papers
No similar papers found.
Qinyu Chen
Qinyu Chen
Assistant Professor, Leiden University
Edge AIIC designNeuromorphic ComputingEvent-based visionAR/VR
J
Jiawen Qi
Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, The Netherlands