🤖 AI Summary
Existing methods model human-machine collaborative compression based on human visual processing, failing to simultaneously satisfy machine vision’s low-information requirements and human vision’s high-fidelity demands. This paper proposes the first machine-vision-oriented compression framework that inversely supports human visual reconstruction. Specifically, we first design a lightweight, task-aware encoder to extract semantics-critical features for downstream machine vision tasks; second, we introduce a diffusion-prior-guided semantic aggregation module to progressively restore perceptually essential details for human viewing; third, we devise a plug-and-play variable-bitrate strategy enabling multi-task adaptive bit allocation. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches on both machine vision performance (e.g., detection and recognition accuracy) and human visual quality (PSNR/MS-SSIM), achieving unified low-bitrate coding (37% reduction) and high fidelity (2.1 dB PSNR gain). These results validate the effectiveness and generality of the “machine-first, human-enhanced” paradigm.
📝 Abstract
Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.