🤖 AI Summary
The absence of standardized, quantitative AI security evaluation benchmarks for IoT/IIoT hinders rigorous comparison of model effectiveness. Method: This paper proposes the first end-to-end attack analysis and mitigation framework integrating machine learning and large language models (LLMs), featuring structured role-based prompt engineering, retrieval-augmented generation (RAG), and a multi-judge LLM collaborative assessment mechanism. Contribution/Results: It establishes the first quantitative evaluation metric system targeting attack identification, behavioral analysis, and mitigation recommendation. Experiments on Edge-IIoTset and CICIoT2023 demonstrate that Random Forest achieves optimal detection accuracy, while ChatGPT-o3 significantly outperforms baseline LLMs (e.g., DeepSeek) in analytical depth and recommendation quality. This work introduces a reproducible, scalable paradigm for AI-driven IIoT security assessment.
📝 Abstract
The Internet of Things has expanded rapidly, transforming communication and operations across industries but also increasing the attack surface and security breaches. Artificial Intelligence plays a key role in securing IoT, enabling attack detection, attack behavior analysis, and mitigation suggestion. Despite advancements, evaluations remain purely qualitative, and the lack of a standardized, objective benchmark for quantitatively measuring AI-based attack analysis and mitigation hinders consistent assessment of model effectiveness. In this work, we propose a hybrid framework combining Machine Learning (ML) for multi-class attack detection with Large Language Models (LLMs) for attack behavior analysis and mitigation suggestion. After benchmarking several ML and Deep Learning (DL) classifiers on the Edge-IIoTset and CICIoT2023 datasets, we applied structured role-play prompt engineering with Retrieval-Augmented Generation (RAG) to guide ChatGPT-o3 and DeepSeek-R1 in producing detailed, context-aware responses. We introduce novel evaluation metrics for quantitative assessment to guide us and an ensemble of judge LLMs, namely ChatGPT-4o, DeepSeek-V3, Mixtral 8x7B Instruct, Gemini 2.5 Flash, Meta Llama 4, TII Falcon H1 34B Instruct, xAI Grok 3, and Claude 4 Sonnet, to independently evaluate the responses. Results show that Random Forest has the best detection model, and ChatGPT-o3 outperformed DeepSeek-R1 in attack analysis and mitigation.