Xiaomi MiMo-VL-Miloco Technical Report

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing domain-specific understanding with general multimodal reasoning in smart home scenarios, this paper introduces MiMo-VL-Miloco-7B—the first vision-language model explicitly designed for domestic environments. Our method employs a novel two-stage training paradigm: supervised fine-tuning followed by group-relative policy optimization (GRPO)-based reinforcement learning, augmented with chain-of-thought supervision and token-budget-aware inference to jointly enhance domain adaptation and generalization. Built upon the MiMo-VL-7B architecture, the model integrates heterogeneous smart-home data, chain-of-reasoning modeling, and GGUF-based quantization for efficient deployment. Experiments demonstrate state-of-the-art performance on domain-specific tasks—including gesture recognition and household scene understanding—achieving superior F1 scores over leading closed- and open-source VLMs. Moreover, MiMo-VL-Miloco-7B maintains top-tier performance on general multimodal benchmarks such as Video-MME and MMMU-Pro.

Technology Category

Application Category

📝 Abstract
We open-source extbf{MiMo-VL-Miloco-7B} and its quantized variant extbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at href{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco} to support research and deployment in real-world smart-home applications.
Problem

Research questions and friction points this paper is trying to address.

Develops vision-language models specialized for smart-home environment understanding
Balances home-scenario specialization with general multimodal reasoning capabilities
Addresses efficient knowledge learning and reasoning for real-world smart-home applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with supervised fine-tuning and reinforcement learning
Incorporates chain-of-thought supervision and token-budget-aware reasoning
Specialized for smart-home environments using multi-domain data
🔎 Similar Papers
No similar papers found.