Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Foundation models (FMs) exhibit poor generalization in real-world unstructured scenarios—such as occlusion and multilingual text—primarily due to significant distributional shift between pretraining data and real-world environments. Method: We propose a robot-driven data flywheel framework that uniquely transforms embodied robots from FM users into autonomous data producers: while performing tasks in situ, robots concurrently collect visual-language data, automatically annotate it (e.g., leveraging library catalogs for vision-language model–assisted image labeling), and perform closed-loop fine-tuning of vision-language models (VLMs). Contribution/Results: Deployed for two weeks in an East Asian library, the robot scanned 2,103 shelf layers, boosting VLM-based book recognition accuracy from 32.0% to 71.8% and substantially improving multilingual OCR performance—equivalent to ~18.7 hours of saved manual annotation effort. The framework enables fully autonomous, human-in-the-loop-free domain adaptation and facilitates cross-domain generalization capability evolution.

Technology Category

Application Category

📝 Abstract
Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io
Problem

Research questions and friction points this paper is trying to address.

Foundation models fail in real-world settings due to unrepresented messy data
Robots can collect real-world data to improve model adaptation and generalization
Current models struggle with unstructured environments like occlusions and multilingual text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robots collect real-world data to improve foundation models
Autonomous robots scan environments and label data automatically
Creates continuous improvement cycle between robots and AI models
🔎 Similar Papers
No similar papers found.