🤖 AI Summary
This work addresses the challenge in social robot navigation where lightweight vision-language models suffer from limited reasoning capabilities, while large models incur prohibitive computational costs, making it difficult to balance efficiency and performance. To this end, the authors propose a Group Competitive Learning (GCL) strategy that integrates global semantics and distributional regularization through a Group Competition Objective (GCO), alongside an Asymmetric Group Optimization (AGO) mechanism to enhance semantic understanding and decision-making in compact models. Experiments based on Qwen2.5-VL-3B demonstrate that the proposed method achieves an F1 score of 0.968, representing a 40% improvement over supervised fine-tuning and even outperforming the original 8B model by 28%, thereby significantly boosting both accuracy and deployment efficiency.
📝 Abstract
Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.