Symmetric Entropy-Constrained Video Coding for Machines

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video coding for machine vision systems (MVS) must jointly optimize multi-task generalization and semantic fidelity, yet existing video coding for machines (VCM) methods suffer from limited generalizability due to downstream-model coupling or supervised fine-tuning. To address this, we propose SEC-VCM, a Symmetric Entropy-Constrained VCM framework. It introduces the first bidirectional entropy constraint between the codec and a vision backbone (VB), enabling simultaneous semantic preservation and redundancy suppression. A semantic-pixel dual-path fusion module explicitly models high-level semantics while suppressing machine-harmful artifacts. Furthermore, VB-driven unsupervised semantic-guided encoding eliminates reliance on labeled data or task-specific downstream models. Evaluated on instance segmentation and object detection, SEC-VCM achieves state-of-the-art performance, reducing bitrates by 37.41%–46.22% compared to VTM.

Technology Category

Application Category

📝 Abstract
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data and thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial for MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art (SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.41%), video object segmentation (29.83%), object detection (46.22%), and multiple object tracking (44.94%). We will release our code.
Problem

Research questions and friction points this paper is trying to address.

Develops symmetric video coding for machine vision systems
Aligns video codec with visual backbone to preserve semantics
Reduces bitrate while maintaining machine task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symmetric alignment between video codec and visual backbone
Bi-directional entropy constraint suppresses conditional entropy
Semantic-pixel dual-path fusion improves machine reconstruction quality
Y
Yuxiao Sun
Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China, and also with the Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China, and also with the Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China
Meiqin Liu
Meiqin Liu
Zhejiang University
Control Theory and Control Engineering
Chao Yao
Chao Yao
Northwestern polytechnical university
Jian Jin
Jian Jin
Alibaba-NTU Joint Research Institute, Singapore
3D imagingvideo codingperceptual modelingcomputer vision
Weisi Lin
Weisi Lin
President's Chair Professor in Computer Science, CCDS, Nanyang Technological Unversity
Perception-inspired signal modelingperceptual multimedia quality evaluationvideo compressionimage processing & analysis