🤖 AI Summary
AI-powered roadside cameras in vehicle-infrastructure cooperative systems pose privacy risks—e.g., re-identifying individuals via clothing or other visual cues. Method: We propose a semantic-level privacy-preserving paradigm that converts sensitive visual inputs into semantically equivalent, non-re-identifiable textual descriptions in real time. Our approach introduces a novel hierarchical text generation framework integrating feedback-driven reinforcement learning with vision-language models (VLMs), incorporating semantic alignment constraints and multi-stage policy optimization to overcome the limitations of conventional pixel-level anonymization. Contribution/Results: Experiments demonstrate a 77% increase in lexical diversity (unique word count) and ~50% higher detail density in generated descriptions. Crucially, the method fully eliminates re-identifiable features—including faces and apparel—while preserving performance on downstream tasks such as traffic violation detection, thereby advancing privacy protection from “unobservable” to “unidentifiable.”
📝 Abstract
Connected and Autonomous Vehicles (CAVs) rely on a range of devices that often process privacy-sensitive data. Among these, roadside units play a critical role particularly through the use of AI-equipped (AIE) cameras for applications such as violation detection. However, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. While traditional techniques such as face blurring and obfuscation have been applied to mitigate privacy risks, individual privacy remains at risk, as individuals can still be tracked using other features such as their clothing. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The main idea is to convert images into semantically equivalent textual descriptions, ensuring that scene-relevant information is retained while visual privacy is preserved. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Evaluation results demonstrate significant improvements in both privacy protection and textual quality, with the Unique Word Count increasing by approximately 77% and Detail Density by around 50% compared to existing approaches.