Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately counting densely packed surgical instruments in operating rooms by proposing a visual reasoning framework that mimics human sequential counting behavior. The method introduces a structured spatial visual chain—termed Chain-of-Look—to guide the model along a coherent visual trajectory, replacing conventional unordered detection paradigms. A proximity-aware loss function is incorporated to explicitly model spatial constraints among neighboring instruments, and high-resolution image analysis is leveraged to enhance counting precision. The contributions include a novel visual reasoning mechanism, the release of SurgCount-HD—a new high-density surgical instrument counting dataset—and state-of-the-art performance that significantly outperforms existing counting methods such as CountGD and REC, as well as multimodal large language models including Qwen and ChatGPT.

Technology Category

Application Category

📝 Abstract
Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
Problem

Research questions and friction points this paper is trying to address.

surgical instrument counting
dense object counting
operating room safety
visual reasoning
high-density scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Look
spatial reasoning
dense object counting
neighboring loss
surgical instrument counting
🔎 Similar Papers
No similar papers found.
R
Rishikesh Bhyri
State University of New York at Buffalo
B
Brian R Quaranto
State University of New York at Buffalo
P
Philip J Seger
State University of New York at Buffalo
K
Kaity Tung
State University of New York at Buffalo
B
Brendan Fox
State University of New York at Buffalo
G
Gene Yang
State University of New York at Buffalo
S
Steven D. Schwaitzberg
State University of New York at Buffalo
Junsong Yuan
Junsong Yuan
State University of New York at Buffalo
computer visionvideo analyticsaction and gesture analysismultimediapattern recognition
Nan Xi
Nan Xi
University at Buffalo
Computer VisionPattern RecognitionMedical AI
P
Peter C W Kim
State University of New York at Buffalo