Live Interactive Training for Video Segmentation

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitation of existing interactive video segmentation methods, which struggle to continually learn from user feedback, often necessitating numerous repetitive corrections in complex scenes. To overcome this, the authors propose the Live Interactive Training (LIT) framework—the first approach enabling online continual learning for vision models during inference. At its core, LIT employs a lightweight LIT-LoRA module that dynamically integrates user corrections via a LoRA-based architecture and generalizes these updates to subsequent video frames in real time. The method substantially reduces human intervention, achieving an 18–34% average reduction in total correction counts across multiple video segmentation benchmarks. Each online update requires only approximately 0.5 seconds and demonstrates seamless transferability to other segmentation models as well as CLIP-based image classification tasks.

Technology Category

Application Category

📝 Abstract

Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.

Problem

Research questions and friction points this paper is trying to address.

interactive video segmentation

user corrections

online learning

redundant human effort

challenging scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Live Interactive Training

Online Learning

LoRA