UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration

📅 2025-09-26
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing codecs are designed for unimodal, unidirectional communication, suffering from progressive performance degradation across the compression–transmission–reconstruction pipeline. This work proposes the first unified multimodal interactive encoding framework tailored for human–AI collaboration, replacing raw pixels and text with tokenized representations to enable efficient bidirectional communication between edge devices and cloud-based AI agents. Methodologically, we introduce a scene-adaptive lightweight Transformer-based entropy model, integrated with hybrid compression strategies—general, masked, and text-conditioned—to substantially reduce inter-token redundancy. Evaluated on diverse downstream tasks—including text-to-image generation, image inpainting, outpainting, and visual question answering—the framework achieves transmission bitrates below 0.05 bits per pixel (bpp) while preserving full task performance. This demonstrates the paradigm’s exceptional efficiency and robustness under ultra-low-bitrate constraints.

Technology Category

Application Category

📝 Abstract
The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.
Problem

Research questions and friction points this paper is trying to address.

UniMIC enables efficient multimodal communication between edge devices and cloud AI
It replaces raw data with compact token representations for low-bitrate transmission
The framework maintains task performance while achieving ultra-low bitrate compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-based multimodal interactive coding framework
Lightweight Transformer entropy models minimize redundancy
Compact tokenized representations enable low-bitrate transmission
🔎 Similar Papers
No similar papers found.
Q
Qi Mao
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China
Tinghan Yang
Tinghan Yang
Purdue University
Mobile computing and mobile sensingwireless communications
J
Jiahao Li
Microsoft Research Asia, Beijing 10080, China
B
Bin Li
Microsoft Research Asia, Beijing 10080, China
L
Libiao Jin
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China
Y
Yan Lu
Microsoft Research Asia, Beijing 10080, China