🤖 AI Summary
To address the high computational cost of multimodal large language models (MLLMs) and the semantic inaccuracies in lightweight models caused by visual blind spots, this paper proposes the Sharp-Eyed Refinement framework—a lightweight, locally deployable image captioning model. Methodologically, it introduces (1) the DeepLens module to enhance fine-grained visual localization, mitigating visual blind spots in compact architectures; and (2) a vision-expert system built upon a 125M-parameter language model, integrating attention optimization and region-focused visual representation extraction for efficient vision–language alignment. Experimental results demonstrate that our model significantly outperforms existing small-scale captioning models on both single-sentence and fine-grained description tasks, achieving performance comparable to mainstream MLLMs. These findings validate the effectiveness and practical viability of lightweight vision-expert systems for resource-constrained deployment.
📝 Abstract
Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.