Ruyi2.5 Technical Report

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing efficiency and privacy in multimodal models under cross-deployment semantic inconsistencies and privacy-sensitive scenarios. Building upon the AI Flow framework, the authors propose a family of multimodal models with a shared backbone, enabling “train once, deploy everywhere.” To preserve privacy, they introduce an information bottleneck–guided irreversible feature desensitization mechanism within an edge–cloud two-stage inference architecture. Additionally, they design a Binary Prefix Policy Optimization (BPPO) algorithm for reinforcement learning–based fine-tuning. The approach matches the performance of Qwen3-VL on standard multimodal benchmarks while significantly outperforming it in privacy-constrained surveillance tasks, with BPPO yielding a 2–3× acceleration in training speed.

Technology Category

Application Category

📝 Abstract
We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2's "Train Once, Deploy Many" paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.
Problem

Research questions and friction points this paper is trying to address.

multimodal AI
privacy-preserving
edge-cloud collaboration
semantic consistency
surveillance tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal familial model
privacy-preserving camera system
shared-backbone architecture
Binary Prefix Policy Optimization
information-bottleneck-guided feature mapping
Huan Song
Huan Song
Amazon AWS AI
Deep learningmachine learninggraph neural networkstime-series analysis
S
Shuyu Tian
Institute of Artificial Intelligence (TeleAI), China Telecom
Qingfei Zhao
Qingfei Zhao
University of the Chinese Academy of Sciences
Natural Language ProcessingArtificial Intelligence
W
Wenhao Hong
Institute of Artificial Intelligence (TeleAI), China Telecom
J
Jiang Liu
Institute of Artificial Intelligence (TeleAI), China Telecom
T
Ting Long
Institute of Artificial Intelligence (TeleAI), China Telecom
J
Jiawei Shao
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom