HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing surgical video generation methods predominantly adopt unconditional modeling, resulting in poor alignment with surgical phases and action semantics, thereby compromising factual accuracy and visual realism. To address this, we propose a two-stage hierarchical-aware diffusion model—the first to jointly model surgical phases, action triplets, and panoramic segmentation maps—enabling synergistic optimization across semantic abstraction and visual detail. Built upon a conditional diffusion framework, our model integrates a segmentation-prediction backbone with a texture-enhanced generator, and incorporates multi-granularity surgical priors (phase, action, and anatomical structure). Evaluated on a cholecystectomy dataset, it significantly outperforms baseline methods, supports high-frame-rate generation, and achieves superior semantic fidelity, fine-grained texture quality, and strong cross-scenario generalization.

Technology Category

Application Category

📝 Abstract

Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

Problem

Research questions and friction points this paper is trying to address.

Generate surgical videos with action-phase consistency

Integrate multi-level surgical semantic guidance

Enhance temporal coherence and texture rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchy-aware surgical video generation framework

Two-stage diffusion models for semantic and visual synthesis

Multi-level surgical information integration

🔎 Similar Papers

No similar papers found.