🤖 AI Summary
Industrial computer vision suffers from insufficient robustness of conventional edge detectors under noise, material variability, and non-ideal imaging conditions. This paper proposes a language-guided generative framework for industrial vision, integrating conditional generative adversarial networks (cGANs) with multimodal vision-language models (GPT-image-1 and Gemini 2.0 Flash) to achieve end-to-end generation and refinement of CAD-level residual contours. We innovatively introduce human-editable standardized prompting and a text-image co-synthesis mechanism, overcoming limitations of handcrafted feature engineering and unimodal modeling. Evaluated on the private FabTrack dataset, our method significantly improves contour fidelity, edge continuity, and geometric alignment accuracy, substantially reducing manual tracing effort. Quantitative and qualitative analyses demonstrate that GPT-image-1 outperforms Gemini 2.0 Flash in structural accuracy and visual quality.
📝 Abstract
Industrial computer vision systems often struggle with noise, material variability, and uncontrolled imaging conditions, limiting the effectiveness of classical edge detectors and handcrafted pipelines. In this work, we present a language-guided generative vision system for remnant contour detection in manufacturing, designed to achieve CAD-level precision. The system is organized into three stages: data acquisition and preprocessing, contour generation using a conditional GAN, and multimodal contour refinement through vision-language modeling, where standardized prompts are crafted in a human-in-the-loop process and applied through image-text guided synthesis. On proprietary FabTrack datasets, the proposed system improved contour fidelity, enhancing edge continuity and geometric alignment while reducing manual tracing. For the refinement stage, we benchmarked several vision-language models, including Google's Gemini 2.0 Flash, OpenAI's GPT-image-1 integrated within a VLM-guided workflow, and open-source baselines. Under standardized conditions, GPT-image-1 consistently outperformed Gemini 2.0 Flash in both structural accuracy and perceptual quality. These findings demonstrate the promise of VLM-guided generative workflows for advancing industrial computer vision beyond the limitations of classical pipelines.