Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

This work addresses the detrimental impact of annotation errors—such as mislabeling and temporal misalignment—in video datasets on the performance of models for temporally sensitive tasks. To tackle this issue, the authors propose a model-agnostic approach based on dynamic loss trajectory analysis: by tracking the average loss of each frame across multiple training checkpoints, they construct cumulative sample loss (CSL) trajectories that serve as frame-level learnability fingerprints. This method enables the identification of hard-to-learn samples without requiring ground-truth error labels. Experiments on the EgoPER and Cholec80 datasets demonstrate that the proposed technique effectively detects subtle annotation inaccuracies, exhibiting strong generalization capability and practical utility in real-world scenarios.

Technology Category

Application Category

📝 Abstract

High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.

Problem

Research questions and friction points this paper is trying to address.

annotation errors

mislabeling

disordering

video datasets

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cumulative Sample Loss

annotation error detection

loss trajectory