๐ค AI Summary
This work addresses the challenge of recognizing visually similar materials such as glass and plastic, whose reflectivity and transparency hinder accurate identification in open-world settings, where existing vision- or radar-based methods suffer from poor semantic interpretability and limited generalization. The authors propose a training-free multimodal fusion framework that, for the first time, integrates radar-based dielectric constant estimation into a vision-language model (VLM). Their dual-channel architecture separately generates material candidates and physical parameters, employing a Peak Reflection Cell Area (PRCA) method to extract dielectric constants. Coupled with context-enhanced generation and an uncertainty-driven adaptive fusion mechanism, the approach achieves 96.08% material recognition accuracy across 120 real-world experiments involving 41 everyday objects and four types of deceptive samplesโmatching state-of-the-art closed-set performance without requiring task-specific training data, thereby enabling accurate, interpretable, and open-set material recognition without training.
๐ Abstract
Accurate material recognition is a fundamental capability for intelligent perception systems to interact safely and effectively with the physical world. For instance, distinguishing visually similar objects like glass and plastic cups is critical for safety but challenging for vision-based methods due to specular reflections, transparency, and visual deception. While millimeter-wave (mmWave) radar offers robust material sensing regardless of lighting, existing camera-radar fusion methods are limited to closed-set categories and lack semantic interpretability. In this paper, we introduce VLMaterial, a training-free framework that fuses vision-language models (VLMs) with domain-specific radar knowledge for physics-grounded material identification. First, we propose a dual-pipeline architecture: an optical pipeline uses the segment anything model and VLM for material candidate proposals, while an electromagnetic characterization pipeline extracts the intrinsic dielectric constant from radar signals via an effective peak reflection cell area (PRCA) method and weighted vector synthesis. Second, we employ a context-augmented generation (CAG) strategy to equip the VLM with radar-specific physical knowledge, enabling it to interpret electromagnetic parameters as stable references. Third, an adaptive fusion mechanism is introduced to intelligently integrate outputs from both sensors by resolving cross-modal conflicts based on uncertainty estimation. We evaluated VLMaterial in over 120 real-world experiments involving 41 diverse everyday objects and 4 typical visually deceptive counterfeits across varying environments. Experimental results demonstrate that VLMaterial achieves a recognition accuracy of 96.08%, delivering performance on par with state-of-the-art closed-set benchmarks while eliminating the need for extensive task-specific data collection and training.