Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection in Educational Videos

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

A lack of benchmark datasets hinders visual object detection research in educational videos. Method: This paper introduces LVVO, the first large-scale, education-specific visual object detection benchmark, comprising 245 lecture videos from biology, computer science, and geoscience, with 4,000 annotated frames. Annotations are fine-grained across four object types: tables, charts, photographs, and diagrams. We formally define canonical visual object categories in educational contexts and employ a dual-annotator + expert arbitration protocol, achieving 83.41% inter-annotator F1 agreement. A novel semi-supervised strategy—integrating confidence-based filtering and model self-training—is proposed to expand the dataset into the high-quality LVVO_3k subset (3,000 frames). Contribution/Results: We publicly release LVVO_1k (1,000 human-verified frames) and LVVO_3k, establishing the first dedicated benchmark for educational video understanding and enabling rigorous development and evaluation of both supervised and semi-supervised detection methods.

Technology Category

Application Category

📝 Abstract

We introduce the Lecture Video Visual Objects (LVVO) dataset, a new benchmark for visual object detection in educational video content. The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences. A subset of 1,000 frames, referred to as LVVO_1k, has been manually annotated with bounding boxes for four visual categories: Table, Chart-Graph, Photographic-image, and Visual-illustration. Each frame was labeled independently by two annotators, resulting in an inter-annotator F1 score of 83.41%, indicating strong agreement. To ensure high-quality consensus annotations, a third expert reviewed and resolved all cases of disagreement through a conflict resolution process. To expand the dataset, a semi-supervised approach was employed to automatically annotate the remaining 3,000 frames, forming LVVO_3k. The complete dataset offers a valuable resource for developing and evaluating both supervised and semi-supervised methods for visual content detection in educational videos. The LVVO dataset is publicly available to support further research in this domain.

Problem

Research questions and friction points this paper is trying to address.

Benchmark for detecting visual objects in educational videos

Dataset with annotated frames from diverse academic disciplines

Resource for supervised and semi-supervised detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually annotated 1k frames with bounding boxes

Semi-supervised auto-annotation for 3k frames

Inter-annotator F1 score of 83.41% achieved

🔎 Similar Papers

Learning Spatial-Semantic Features for Robust Video Object Segmentation