VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility of large vision-language models (LVLMs) for face age estimation under zero-shot settings without labeled training data. We systematically evaluate state-of-the-art models—including GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision—on the UTKFace and FG-NET datasets using eight metrics such as mean absolute error (MAE), coefficient of determination (R²), concordance correlation coefficient (CCC), and accuracy within ±5 years. Our analysis further examines the impact of image quality and demographic subgroups. We establish the first reproducible benchmark for zero-shot face age estimation, revealing both the promising capabilities and fairness challenges of general-purpose multimodal models in biometric applications. Results demonstrate that LVLMs can achieve competitive performance in zero-shot scenarios, yet remain constrained by prompt sensitivity, computational cost, and disparities across demographic groups.
📝 Abstract
Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.
Problem

Research questions and friction points this paper is trying to address.

zero-shot age estimation
large vision-language models
facial age estimation
demographic fairness
biometric inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot age estimation
large vision-language models
biometric fairness
VLAgeBench
multimodal inference
🔎 Similar Papers
No similar papers found.
R
Rakib Hossain Sajib
Department of Computer Science and Engineering, Begum Rokeya University, Rangpur, Rangpur, Bangladesh
Md Kishor Morol
Md Kishor Morol
Assistant Professor of CS
Explainable AIDeep LearningNatural Language ProcessingMedical Image Processing
Rajan Das Gupta
Rajan Das Gupta
B.Sc in CSE (AIUB), M.Sc in CS (JU)
Health InformaticsAI in HealthcareComputer VisionLLMNLP
M
Mohammad Sakib Mahmood
EliteLab.AI, Queens, New York, United States
S
Shuvra Smaran Das
EliteLab.AI, Queens, New York, United States