Live Interactive Training for Video Segmentation

πŸ“… 2026-03-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing interactive video segmentation methods, which struggle to continually learn from user feedback, often necessitating numerous repetitive corrections in complex scenes. To overcome this, the authors propose the Live Interactive Training (LIT) frameworkβ€”the first approach enabling online continual learning for vision models during inference. At its core, LIT employs a lightweight LIT-LoRA module that dynamically integrates user corrections via a LoRA-based architecture and generalizes these updates to subsequent video frames in real time. The method substantially reduces human intervention, achieving an 18–34% average reduction in total correction counts across multiple video segmentation benchmarks. Each online update requires only approximately 0.5 seconds and demonstrates seamless transferability to other segmentation models as well as CLIP-based image classification tasks.
πŸ“ Abstract
Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.
Problem

Research questions and friction points this paper is trying to address.

interactive video segmentation
user corrections
online learning
redundant human effort
challenging scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Live Interactive Training
Online Learning
LoRA
Interactive Video Segmentation
Human-in-the-loop
πŸ”Ž Similar Papers
No similar papers found.
Xinyu Yang
Xinyu Yang
Cornell University
Machine Learning
H
Haozheng Yu
Cornell University
Y
Yihong Sun
Cornell University
B
Bharath Hariharan
Cornell University
Jennifer J. Sun
Jennifer J. Sun
Assistant professor at Cornell CS
machine learningcomputer visionAI for science