Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation in generalization and catastrophic forgetting observed in current mainstream multimodal models under content moderation and adversarial scenarios, primarily due to insufficient fine-grained visual perception and weak modeling of long-tailed noise. To mitigate these limitations, the authors propose a data-training co-optimization paradigm that integrates a compact architecture—comprising InternViT-300M, an MLP head, and Qwen3-1.7B—with a three-stage progressive training pipeline (pre-training, mid-training, and post-training). This approach effectively balances general-purpose capability retention and domain-specific adaptability within a constrained parameter budget. The resulting model achieves an average score of 67.90 across seven multimodal benchmarks on OpenCompass, an average recall of 94.38% on seven content moderation tasks, and a weighted recall of 82.82% on adversarial OCR-based violation detection, outperforming Gemini-2.5-Pro.
📝 Abstract
In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.
Problem

Research questions and friction points this paper is trying to address.

multimodal models
content moderation
catastrophic forgetting
fine-grained visual perception
long-tail noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

industrial-grade foundation model
multimodal alignment
progressive training pipeline
fine-grained visual perception
adversarial content moderation
🔎 Similar Papers
No similar papers found.
Z
Zhiqian Zhang
Computational Intelligence Dept, Hello Group Inc.
X
Xu Zhao
Computational Intelligence Dept, Hello Group Inc.
Xiaoqing Xu
Xiaoqing Xu
Google DeepMind
Design AutomationPower Modeling3D Integration
G
Guangdong Liang
Computational Intelligence Dept, Hello Group Inc.
Weijia Wang
Weijia Wang
PhD in Applied Physics, Northwestern University
PlasmonicsNanotechnology
X
Xiaolei Lv
Computational Intelligence Dept, Hello Group Inc.
B
Bo Li
Computational Intelligence Dept, Hello Group Inc.
J
Jun Gao
Computational Intelligence Dept, Hello Group Inc.