DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

Human-centric dense prediction tasks—such as depth estimation, surface normal prediction, and soft foreground segmentation—face challenges including reliance on large-scale real-world annotations, high computational costs, labeling difficulty, and fairness assurance. To address these, this paper proposes a few-shot learning paradigm leveraging procedurally generated, high-fidelity synthetic data. Our controllable synthesis pipeline ensures data diversity, class balance, and licensing compliance, with full provenance transparency. Using perfectly annotated synthetic data, we train lightweight models that achieve accuracy on par with fully supervised baselines on real images, while significantly reducing training and inference costs. Key contributions include: (1) the first open-source, high-fidelity synthetic dataset for multi-task human-centric dense prediction; (2) an efficient, reproducible few-shot training framework; and (3) synergistic optimization across accuracy, efficiency, fairness, and data compliance. Code, models, and the dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract

The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAViD.

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on large datasets for accurate vision models

Ensuring data provenance and fairness with synthetic datasets

Achieving cost-efficient training and inference in dense prediction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses high-fidelity synthetic training data

Ensures perfect labels and data provenance

Controls data diversity to address unfairness

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)