Can Vision-Language Models Understand Construction Workers? An Exploratory Study

📅 2026-01-15
🏛️ IEEE Access
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of scarce labeled data in construction site monitoring by systematically evaluating the zero-shot and few-shot capabilities of general-purpose vision-language models—including GPT-4o, Florence 2, and LLaVA-1.5—in recognizing human behaviors and emotions from static construction site images. Using a standardized inference pipeline and a dataset of 1,000 images, the models are benchmarked through multiple metrics, including F1-score, accuracy, and confusion matrices. Results demonstrate that GPT-4o significantly outperforms the others, achieving F1-scores of 0.756 for behavior recognition and 0.712 for emotion recognition. However, all models struggle to distinguish between semantically similar categories, highlighting both the potential and limitations of off-the-shelf vision-language models in domain-specific, safety-critical environments such as construction sites.

Technology Category

Application Category

📝 Abstract
As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs-GPT-4o, Florence 2, and LLaVa-1.5-in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model’s outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 (action) and 0.414 (emotion), while LLaVa-1.5 showed the lowest overall performance (F1-scores of 0.466 for action and 0.461 for emotion). Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories-such as “Collaborating in teams” versus “Communicating with supervisors,” or “Focused” versus “Determined”-highlighting limitations in current VLMs when applied to visually nuanced, domain-specific tasks. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements-such as domain adaptation, temporal modeling, or multimodal sensing-may be needed for real-world reliability. This study provides an initial benchmark and practical insights for deploying human-aware AI systems in complex, safety-critical settings.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
construction workers
action recognition
emotion recognition
human behavior understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
construction worker behavior
emotion recognition
zero-shot inference
human-robot collaboration
🔎 Similar Papers
No similar papers found.