🤖 AI Summary
To address low accuracy and poor interpretability in fabric attribute recognition for textile manufacturing, apparel production, and intelligent retail, this paper proposes a multimodal large language model (MLLM)-driven robotic sorting system. The system integrates RGB vision, visuo-tactile, and pressure-sensing data into an end-to-end framework for fabric attribute understanding and decision-making. We introduce a novel multimodal explanation-guided knowledge distillation method combined with supervised fine-tuning, significantly improving both attribute ranking accuracy and decision interpretability. Our released Fabric-Llama-90B model outperforms state-of-the-art vision-language models on fabric attribute ranking and selection tasks. Concurrently, we open-source a multimodal dataset comprising 220 fabric samples—featuring synchronized RGB, tactile, and pressure modalities—establishing a new benchmark and resource for MLLM research in embodied interaction scenarios.
📝 Abstract
Choosing the right fabric is crucial to meet functional and quality requirements in robotic applications for textile manufacturing, apparel production, and smart retail. We present MLLM-Fabric, a robotic framework powered by multimodal large language models (MLLMs) for fabric sorting and selection. The system includes a robotic arm, a camera, a visuotactile sensor, and a pressure sensor. It employs supervised fine-tuning and multimodal explanation-guided knowledge distillation to accurately classify and rank fabric properties. To facilitate further research, we release a dataset of 220 unique fabric samples, including RGB images and synchronized visuotactile and pressure data. Experimental results show that our Fabric-Llama-90B model consistently outperforms pretrained vision-language baselines in both property ranking accuracy and selection reliability.