🤖 AI Summary
This work addresses few-shot learning (FSL) in video and 3D object detection, aiming to enable efficient recognition of novel categories from extremely limited annotated samples—thereby alleviating the burdens of cross-frame video annotation and the generalization challenges posed by sparse, textureless 3D point clouds. We propose a unified cross-modal few-shot detection framework integrating temporal modeling (via a time-matching network and tubelet candidate propagation) with point-cloud-specific design (a lightweight few-shot point cloud neural network), complemented by a class-balanced loss to mitigate bias between base and novel classes. Our core contribution is a novel learning paradigm that jointly suppresses overfitting and enhances cross-scenario generalization. Experiments demonstrate substantial reductions in annotation cost and achieve state-of-the-art performance across multiple video and 3D detection benchmarks, validating the framework’s practical viability for real-world applications such as autonomous driving.
📝 Abstract
Few-shot learning (FSL) enables object detection models to recognize novel classes given only a few annotated examples, thereby reducing expensive manual data labeling. This survey examines recent FSL advances for video and 3D object detection. For video, FSL is especially valuable since annotating objects across frames is more laborious than for static images. By propagating information across frames, techniques like tube proposals and temporal matching networks can detect new classes from a couple examples, efficiently leveraging spatiotemporal structure. FSL for 3D detection from LiDAR or depth data faces challenges like sparsity and lack of texture. Solutions integrate FSL with specialized point cloud networks and losses tailored for class imbalance. Few-shot 3D detection enables practical autonomous driving deployment by minimizing costly 3D annotation needs. Core issues in both domains include balancing generalization and overfitting, integrating prototype matching, and handling data modality properties. In summary, FSL shows promise for reducing annotation requirements and enabling real-world video, 3D, and other applications by efficiently leveraging information across feature, temporal, and data modalities. By comprehensively surveying recent advancements, this paper illuminates FSL's potential to minimize supervision needs and enable deployment across video, 3D, and other real-world applications.