LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the limitations of conventional vision-language models, which rely on uniform sampling and fail to emulate the adaptive nature of human vision, leading to significant performance degradation under low pixel budgets. To overcome this, the authors propose LLMind, a training-free, plug-and-play framework that, for the first time, integrates human foveal encoding and cortical magnification mechanisms into vision-language modeling. LLMind employs a bio-inspired adaptive sampling strategy (BASS) coupled with closed-loop semantic feedback (CSF) at test time to dynamically attend to semantically critical regions and achieve perceptual-linguistic alignment. Using only 1%–5% of the original pixels, the method recovers 82%–97% of full-resolution performance on VQAv2, Seed-Bench, and A-OKVQA, yielding absolute improvements of 20%, 38%, and 37%, respectively.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

adaptive visual representation

foveated vision

pixel budget

bio-inspired perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

bio-inspired vision

training-free adaptation

foveated representation