🤖 AI Summary
Deploying vision programming—such as LLM-based visual question answering (VQA) that generates executable code—is challenging on resource-constrained edge devices due to the high computational and annotation costs of large language models (LLMs).
Method: This paper proposes a zero-manual-annotation visual program generation framework tailored for small-scale language models (<1B parameters). Its core innovation is “skill templating”: decoupling visual programs into reusable structural templates and instance-specific parameters, enabling template-driven synthetic data generation to replace costly human annotation. The approach further incorporates lightweight knowledge distillation and vision–language joint reasoning optimization.
Contribution/Results: Trained with only a few annotated QA pairs, the method significantly reduces adaptation overhead. Experiments demonstrate competitive performance with large models on VQA benchmarks, while achieving several-fold speedup in inference. It establishes an efficient, practical pathway for vision programming on edge devices.
📝 Abstract
For users with limited computational resources, visual programming or prompting large language models (LLMs) to generate executable code for visual tasks, like visual question answering (VQA), remains largely inaccessible. Even with techniques such as distillation, adapting visual programming to smaller models or specific datasets is still quite challenging due to high annotation costs. We propose a low-cost visual program distillation method that can be used for models with fewer than 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality visual programs with the added benefit of much faster inference.