Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge in social robot navigation where lightweight vision-language models suffer from limited reasoning capabilities, while large models incur prohibitive computational costs, making it difficult to balance efficiency and performance. To this end, the authors propose a Group Competitive Learning (GCL) strategy that integrates global semantics and distributional regularization through a Group Competition Objective (GCO), alongside an Asymmetric Group Optimization (AGO) mechanism to enhance semantic understanding and decision-making in compact models. Experiments based on Qwen2.5-VL-3B demonstrate that the proposed method achieves an F1 score of 0.968, representing a 40% improvement over supervised fine-tuning and even outperforming the original 8B model by 28%, thereby significantly boosting both accuracy and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Lightweight Vision Language Models

Socially Compliant Navigation

Model Efficiency

Reasoning Ability

Computational Overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Competitive Learning

Lightweight Vision Language Models

Socially Compliant Navigation