Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenges of spatiotemporal semantic alignment between surgical videos and clinical reports, as well as the scarcity of high-quality privacy-preserving data, by introducing a benchmark dataset comprising 214 simulated surgical procedures paired with expert-generated reports. The authors propose a perception–alignment–reasoning framework featuring a lightweight temporal adapter, Hi-GaTA, which employs hierarchical gating, text-conditioned dual cross-attention, and depth-incremental strategies to efficiently aggregate multiscale temporal information. Integrated with a ViViT-based pre-trained video encoder (Sur40k) and a LoRA-finetuned multimodal large language model, the framework compresses long surgical videos into manageable visual prefixes for report generation. Experimental results demonstrate that the proposed method significantly outperforms existing multimodal large language model baselines, with ablation studies confirming the contribution of each component, establishing state-of-the-art performance in surgical video report generation.

📝 Abstract

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

Problem

Research questions and friction points this paper is trying to address.

surgical video report generation

spatio-temporal alignment

clinician-grade assessment

privacy-preserving dataset

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hi-GaTA

surgical video report generation

temporal aggregation adapter