🤖 AI Summary
This study addresses the challenge of automating X-ray ptychography analysis under low-data regimes. We introduce PtychoBench—the first multimodal, multi-task benchmark for ptychography—and systematically compare two foundational model adaptation paradigms: supervised fine-tuning and in-context learning. Our empirical analysis reveals that the optimal strategy is task-modality-dependent: for visual tasks, their combination achieves a Micro-F1 of 0.728; for textual tasks, in-context learning substantially outperforms expert-designed fine-tuned models (Micro-F1 0.847), demonstrating superior robustness to interference and exposing context interference as a critical limitation in fine-tuning. These findings establish a “task-dependent adaptation path” paradigm, providing a reproducible methodology and empirical framework for efficiently adapting scientific AI agents in data-scarce settings.
📝 Abstract
The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful"super-expert"SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.