VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing disaster image analysis methods produce only coarse-grained labels or segmentation masks, limiting deep situational understanding. To address this, we propose a vision-language framework for generating rich, context-aware disaster image descriptions, integrating external semantic knowledge. The framework employs a dual-path architecture—ResNet50-CNN-LSTM for satellite imagery and ViT for UAV imagery—tailored to distinct sensor modalities. ConceptNet and WordNet are incorporated to enhance lexical coverage and descriptive plausibility. Semantic alignment and information fidelity are jointly evaluated using CLIPScore and our novel metric, InfoMetIC. On the disaster captioning task, InfoMetIC achieves 95.33%, substantially outperforming strong baselines including LLaVA and QwenVL, while maintaining high semantic consistency. The approach delivers more accurate, operationally actionable, and contextually grounded automated disaster narratives.

Technology Category

Application Category

📝 Abstract
Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
Problem

Research questions and friction points this paper is trying to address.

Automating damage assessment from disaster imagery
Enhancing image descriptions with external knowledge
Generating comprehensive captions for satellite and drone photos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-architecture CNN-LSTM and ViT models
External semantic knowledge from ConceptNet and WordNet
Automated generation of information-dense disaster descriptions
🔎 Similar Papers
No similar papers found.