ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the failure of vision-language understanding in open and edge scenarios caused by rare or extreme corner cases. To this end, the authors propose a computationally efficient recursive multi-agent framework that integrates vision-language models with multiple encoders (CLIP, DINOv2, and BEiT). The approach enables automatic purification of noisy data and fine-grained, auditable annotation through tri-modal consistency filtering, region-to-semantic chain verification, mixture-of-experts knowledge distillation, and region-wise adversarial evidence labeling. With minimal human intervention, the system substantially enhances data purity and class separability while remaining deployable on consumer-grade GPUs, thereby providing a high-accuracy, interpretable foundation for downstream tasks.

Technology Category

Application Category

📝 Abstract
Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.
Problem

Research questions and friction points this paper is trying to address.

corner cases
vision-language understanding
data curation
edge scenarios
open-world
Innovation

Methods, ideas, or system contributions that make the work stand out.

corner-case curation
vision-language model
multi-agent recursive pipeline
mixture-of-experts distillation
adversarial labeling
🔎 Similar Papers
No similar papers found.
Y
Yihan Wei
Nanyang Technological University
S
Shenghai Yuan
Nanyang Technological University
Tianchen Deng
Tianchen Deng
Shanghai Jiao Tong University
RoboticsComputer Vision
B
Boyang Lou
Beijing University of Posts and Telecommunications
E
Enwen Hu
Beijing University of Posts and Telecommunications