Improving Token-based Object Detection with Video

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Traditional video object detection methods rely heavily on hand-crafted components—including region proposals, heuristic post-processing (e.g., NMS), and strong spatial-temporal priors—limiting end-to-end learnability and generalization. To address this, we propose VideoPix2Seq: the first end-to-end, discrete token-based sequence modeling framework extending Pix2Seq to the video domain. Its core innovation lies in representing objects as atomic, variable-length 3D spatiotemporal tracklets—eliminating anchors, NMS, and explicit trajectory association entirely. The model employs a Transformer architecture for video-level autoregressive sequence modeling, naturally supporting arbitrary input lengths and seamless integration with multi-object tracking. Evaluated on benchmarks including UA-DETRAC, VideoPix2Seq significantly outperforms static Pix2Seq baselines and achieves performance on par with state-of-the-art methods, demonstrating the effectiveness and robustness of a purely sequence-based paradigm for video object detection.

Technology Category

Application Category

📝 Abstract

This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.

Problem

Research questions and friction points this paper is trying to address.

Extends Pix2Seq for video object detection with token-based sequences

Eliminates need for heuristics-based postprocessing in object detection

Outputs integrated 3D tracklets instead of linked 2D boxes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable-length discrete tokens for object representation

End-to-end 3D box or tracklet output

Scalable with longer video subsequence inputs

🔎 Similar Papers

No similar papers found.