🤖 AI Summary
Efficient organization and semantic analysis of massive image collections remains challenging in digital forensics. Method: This paper proposes an automated framework integrating image clustering, multi-source caption generation, and large language models (LLMs). It employs K-means clustering coupled with Azure AI Vision for initial captioning, followed by three refinement strategies—TF-IDF weighting, template-based filling, and LLM-driven optimization—and systematically evaluates sampling scale (20 images per cluster optimal), prompting techniques (standard prompting outperforms chain-of-thought), and generation methods. Contribution/Results: We introduce a dual-metric evaluation framework based on semantic similarity and coverage. Experiments show that descriptions derived from only 20 representative samples per cluster achieve performance comparable to full-set annotation, drastically reducing computational cost. Moreover, LLM-generated captions significantly surpass traditional baselines in both accuracy and generalizability, validating the effectiveness of lightweight sampling combined with efficient prompting.
📝 Abstract
The rapid increase in digital image creation and retention presents substantial challenges during legal discovery, digital archive, and content management. Corporations and legal teams must organize, analyze, and extract meaningful insights from large image collections under strict time pressures, making manual review impractical and costly. These demands have intensified interest in automated methods that can efficiently organize and describe large-scale image datasets. This paper presents a systematic investigation of automated cluster description generation through the integration of image clustering, image captioning, and large language models (LLMs). We apply K-means clustering to group images into 20 visually coherent clusters and generate base captions using the Azure AI Vision API. We then evaluate three critical dimensions of the cluster description process: (1) image sampling strategies, comparing random, centroid-based, stratified, hybrid, and density-based sampling against using all cluster images; (2) prompting techniques, contrasting standard prompting with chain-of-thought prompting; and (3) description generation methods, comparing LLM-based generation with traditional TF-IDF and template-based approaches. We assess description quality using semantic similarity and coverage metrics. Results show that strategic sampling with 20 images per cluster performs comparably to exhaustive inclusion while significantly reducing computational cost, with only stratified sampling showing modest degradation. LLM-based methods consistently outperform TF-IDF baselines, and standard prompts outperform chain-of-thought prompts for this task. These findings provide practical guidance for deploying scalable, accurate cluster description systems that support high-volume workflows in legal discovery and other domains requiring automated organization of large image collections.