🤖 AI Summary
To address the bottleneck wherein knowledge in deep neural networks is rigidly encoded in weights and thus difficult to edit dynamically, this paper proposes a visual memory framework that decouples image classification into pretrained embedding similarity matching and external memory bank nearest-neighbor retrieval. Methodologically, it introduces the first real-time knowledge insertion/deletion, reverse forgetting (unlearning), and memory pruning capabilities for deep vision models, establishing an intervention-aware and interpretable decision mechanism scalable from per-sample to billion-scale memory. Core technical innovations include a lightweight memory database architecture, efficient approximate nearest-neighbor (ANN) search, and differentiable memory update strategies. Evaluated across multi-scale benchmarks, the framework achieves high-accuracy classification while enabling decision attribution visualization and millisecond-level sample editing—significantly enhancing model controllability, adaptability, and maintainability.
📝 Abstract
Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models -- beyond carving it in"stone"weights.