🤖 AI Summary
To address insufficient robustness in multi-scene recognition under complex marine environments, this paper proposes a lightweight multimodal AI framework. Methodologically, we introduce the first marine-oriented multimodal semantic fusion mechanism, jointly leveraging visual features, textual scene descriptions, and classification vectors generated by multimodal large language models (MLLMs); additionally, we incorporate activation-aware weight quantization (AWQ) to achieve cross-modal feature alignment and efficient model compression. Experimental results demonstrate that the framework achieves 98.0% accuracy on marine multi-scene recognition—outperforming state-of-the-art (SOTA) models by 3.5 percentage points—while compressing the model size to 68.75 MB with only a 0.5% accuracy drop. This substantial reduction in computational overhead significantly facilitates edge deployment, providing critical technical support for intelligent marine robots in environmental monitoring, ecological conservation, and emergency response applications.
📝 Abstract
Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$%$ accuracy, surpassing previous SOTA models by 3.5$%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.