🤖 AI Summary
This work addresses the challenge of recognizing laboratory equipment, reagents, and containers in biochemical instructional videos—complicated by cluttered environments and visual similarity among objects. To this end, we propose a Micro QR code–assisted vision-language alignment method. We introduce BioVL-QR, the first egocentric biochemical experiment vision-language dataset, comprising 23 instructional videos, corresponding protocols, and fine-grained step-level annotations. We design a novel object labeling framework that jointly leverages a custom QR code detector and pretrained gesture detectors (e.g., YOLO, HRNet), substantially reducing annotation effort. Additionally, we formulate video-text alignment modeling and step localization as core tasks. Our approach achieves significant performance gains on biochemical instructional video understanding benchmarks. All data, annotations, and code are publicly released to advance embodied vision-language understanding in biochemical domains.
📝 Abstract
This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation, which is costly and time-consuming. To address the issue, we focus on Micro QR Codes. However, detecting objects using only Micro QR Codes is still difficult due to blur and occlusion caused by object manipulation. To overcome this, we propose an object labeling method combining a Micro QR Code detector with an off-the-shelf hand object detector. As an application of the method and BioVL-QR, we tackled the task of localizing the procedural steps in an instructional video. The experimental results show that using Micro QR Codes and our method improves biochemical video understanding. Data and code are available through https://nishi10mo.github.io/BioVL-QR/