Fine-tuning a vision-language model for fracture-surface morphology recognition

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the lack of domain-specific knowledge in general-purpose vision-language models for recognizing fracture surface morphologies in materials science. To bridge this gap, the authors develop the first specialized vision-language model tailored for fractographic analysis. Leveraging 13,168 images extracted from scientific literature, they fine-tune Qwen3-VL-32B-Instruct using GPT-generated morphological annotations, supplemented with expert-curated rare samples and rotation-based geometric augmentation. The study further explores a collaborative inference framework integrating this fine-tuned model with proprietary closed-source systems. By synergistically combining targeted dataset construction, geometric enhancement, and large-model adaptation, the proposed approach achieves a precision of 0.92 on a 100-sample expert-annotated test set—substantially outperforming both baseline methods (0.35) and leading closed-source multimodal models (peak performance: 0.78)—demonstrating markedly improved recognition capability for rare fracture features.

📝 Abstract

Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.

Problem

Research questions and friction points this paper is trying to address.

fracture-surface morphology

vision-language models

domain-specific visual knowledge

materials characterization

scientific image understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model

fine-tuning

fracture-surface morphology