🤖 AI Summary
This work addresses the poor feature robustness and high computational overhead of vision-language models (VLMs) in real-time scenarios with image noise and low visibility. We propose a frequency-spatial dual-domain co-modeling framework that, for the first time, integrates low-rank frequency-domain features derived from the discrete Fourier transform (DFT) into the spatial weights of pre-trained VLMs (e.g., CLIP or SigLIP), combined with low-rank adaptation (LoRA) for efficient fine-tuning. Crucially, the method incurs no additional inference latency while significantly enhancing cross-modal representation robustness against image noise and degradation. Evaluated on real-world noisy caption generation and visual question answering (VQA) tasks using RealSense UGV-collected data, our approach matches or exceeds the performance of ViT-L/14 and SigLIP baselines—producing more detailed, semantically accurate captions. This demonstrates the effectiveness of frequency-domain priors in enabling real-time, robust VLM understanding under challenging visual conditions.
📝 Abstract
Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).