Leveraging Machine Learning and Large Language Models for Automated Image Clustering and Description in Legal Discovery

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficient organization and semantic analysis of massive image collections remains challenging in digital forensics. Method: This paper proposes an automated framework integrating image clustering, multi-source caption generation, and large language models (LLMs). It employs K-means clustering coupled with Azure AI Vision for initial captioning, followed by three refinement strategies—TF-IDF weighting, template-based filling, and LLM-driven optimization—and systematically evaluates sampling scale (20 images per cluster optimal), prompting techniques (standard prompting outperforms chain-of-thought), and generation methods. Contribution/Results: We introduce a dual-metric evaluation framework based on semantic similarity and coverage. Experiments show that descriptions derived from only 20 representative samples per cluster achieve performance comparable to full-set annotation, drastically reducing computational cost. Moreover, LLM-generated captions significantly surpass traditional baselines in both accuracy and generalizability, validating the effectiveness of lightweight sampling combined with efficient prompting.

Technology Category

Application Category

📝 Abstract
The rapid increase in digital image creation and retention presents substantial challenges during legal discovery, digital archive, and content management. Corporations and legal teams must organize, analyze, and extract meaningful insights from large image collections under strict time pressures, making manual review impractical and costly. These demands have intensified interest in automated methods that can efficiently organize and describe large-scale image datasets. This paper presents a systematic investigation of automated cluster description generation through the integration of image clustering, image captioning, and large language models (LLMs). We apply K-means clustering to group images into 20 visually coherent clusters and generate base captions using the Azure AI Vision API. We then evaluate three critical dimensions of the cluster description process: (1) image sampling strategies, comparing random, centroid-based, stratified, hybrid, and density-based sampling against using all cluster images; (2) prompting techniques, contrasting standard prompting with chain-of-thought prompting; and (3) description generation methods, comparing LLM-based generation with traditional TF-IDF and template-based approaches. We assess description quality using semantic similarity and coverage metrics. Results show that strategic sampling with 20 images per cluster performs comparably to exhaustive inclusion while significantly reducing computational cost, with only stratified sampling showing modest degradation. LLM-based methods consistently outperform TF-IDF baselines, and standard prompts outperform chain-of-thought prompts for this task. These findings provide practical guidance for deploying scalable, accurate cluster description systems that support high-volume workflows in legal discovery and other domains requiring automated organization of large image collections.
Problem

Research questions and friction points this paper is trying to address.

Automates organization and description of large legal image collections
Evaluates sampling strategies and prompting techniques for image clustering
Develops scalable methods for legal discovery workflows using AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates image clustering, captioning, and large language models
Evaluates sampling strategies, prompting techniques, and generation methods
Uses strategic sampling and standard prompts for efficient descriptions
🔎 Similar Papers
No similar papers found.
Q
Qiang Mao
Legal Technology & Data Analytics, Ankura Consulting Group, LLC, Washington, D.C. USA
F
Fusheng Wei
Legal Technology & Data Analytics, Ankura Consulting Group, LLC, Washington, D.C. USA
R
Robert Neary
Legal Technology & Data Analytics, Ankura Consulting Group, LLC, Washington, D.C. USA
Charles Wang
Charles Wang
Professor/Director, Center for Genomics, Loma Linda University
Han Qin
Han Qin
Ankura Consulting Group, LLC.
GeospatialAILegal
J
Jianping Zhang
Legal Technology & Data Analytics, Ankura Consulting Group, LLC, Washington, D.C. USA
N
Nathaniel Huber-Fliflet
Legal Technology & Data Analytics, Ankura Consulting Group, LLC, London, UK