CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address background shift and catastrophic forgetting in incremental object detection—caused by background class overlap across tasks—this paper proposes the Class-Agnostic Shared Attribute (CASA) framework. CASA is the first to incorporate generic semantic attributes from vision-language foundation models into incremental detection, enabling class-agnostic, decoupled attribute representations. It introduces a dynamic attribute selection-and-freezing mechanism alongside an attribute weighting matrix to support adaptive evolution of the semantic space. Built upon OWL-ViT, CASA integrates LLM-generated textual attributes, parameter-efficient fine-tuning (PEFT), and attribute importance modeling. Evaluated on COCO under both two-stage and multi-stage incremental settings, CASA achieves state-of-the-art performance with only a 0.7% parameter increase, significantly enhancing model scalability and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract

Incremental object detection (IOD) is challenged by background shift, where background categories in sequential data may include previously learned or future classes. Inspired by the vision-language foundation models such as CLIP, these models capture shared attributes from extensive image-text paired data during pre-training. We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection. Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes. Specifically, we utilize large language models to generate candidate textual attributes and select the most relevant ones based on current training data, recording their significance in an attribute assignment matrix. For subsequent tasks, we freeze the retained attributes and continue selecting from the remaining candidates while updating the attribute assignment matrix accordingly. Furthermore, we employ OWL-ViT as our baseline, preserving the original parameters of the pre-trained foundation model. Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of IOD. Extensive two-phase and multi-phase experiments on the COCO dataset demonstrate the state-of-the-art performance of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in incremental object detection

Mitigates background shift by learning shared category-agnostic attributes

Ensures knowledge retention and adaptability across sequential tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-agnostic shared attributes for incremental learning

LLM-generated textual attributes for object detection

Frozen retained attributes with dynamic new selection

🔎 Similar Papers

A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training