🤖 AI Summary
Existing reasoning segmentation methods rely on multimodal large language models (MLLMs) with billions of parameters, rendering them impractical for edge-device deployment. Moreover, conventional knowledge distillation—focusing solely on output logits or intermediate feature alignment—fails to preserve the multi-step reasoning chain essential for compositional segmentation. To address this, we propose a digital twin representation that explicitly models and retains the full reasoning path during distillation, effectively decoupling perception from reasoning. Our framework jointly optimizes the student model via supervised fine-tuning and multi-objective reward-based reinforcement learning. The resulting method supports open-vocabulary reasoning segmentation for both images and videos. With only 0.6B parameters, it outperforms models 20× larger across four benchmarks, achieves 7.79 FPS inference speed and 2.1 GB memory footprint, enabling real-time edge deployment.
📝 Abstract
Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.