🤖 AI Summary
Climate change accelerates the degradation of cultural heritage sites, yet conventional single-modal monitoring fails to capture the complex interplay between environmental stressors and material deterioration. To address this, we propose a lightweight multimodal fusion architecture that jointly leverages environmental sensor data (temperature/humidity) and visual imagery for accurate degradation assessment under few-shot conditions. Our approach innovatively simplifies PerceiverIO to a 64-dimensional latent space to mitigate overfitting and introduces an adaptive Barlow Twins loss function that explicitly models cross-modal complementarity while suppressing redundancy. Systematic hyperparameter search further optimizes cross-modal alignment strength. Evaluated on the Strasbourg Cathedral dataset, our method achieves 76.9% accuracy—outperforming state-of-the-art multimodal baselines by 43%, and surpassing unimodal sensor- and image-based models by 15.4% and 30.7%, respectively—demonstrating its efficacy and generalizability for intelligent heritage conservation.
📝 Abstract
Cultural heritage sites face accelerating degradation due to climate change, yet tradi- tional monitoring relies on unimodal analysis (visual inspection or environmental sen- sors alone) that fails to capture the complex interplay between environmental stres- sors and material deterioration. We propose a lightweight multimodal architecture that fuses sensor data (temperature, humidity) with visual imagery to predict degradation severity at heritage sites. Our approach adapts PerceiverIO with two key innovations: (1) simplified encoders (64D latent space) that prevent overfitting on small datasets (n=37 training samples), and (2) Adaptive Barlow Twins loss that encourages modality complementarity rather than redundancy. On data from Strasbourg Cathedral, our model achieves 76.9% accu- racy, a 43% improvement over standard multimodal architectures (VisualBERT, Trans- former) and 25% over vanilla PerceiverIO. Ablation studies reveal that sensor-only achieves 61.5% while image-only reaches 46.2%, confirming successful multimodal synergy. A systematic hyperparameter study identifies an optimal moderate correlation target (τ =0.3) that balances align- ment and complementarity, achieving 69.2% accuracy compared to other τ values (τ =0.1/0.5/0.7: 53.8%, τ =0.9: 61.5%). This work demonstrates that architectural sim- plicity combined with contrastive regularization enables effective multimodal learning in data-scarce heritage monitoring contexts, providing a foundation for AI-driven con- servation decision support systems.