🤖 AI Summary
This study evaluates the real-world effectiveness of platform-level parental control mechanisms in mainstream conversational assistants when used by minors, with a focus on failures in high-risk content detection and parental notification. Using a two-stage protocol to construct a balanced dialogue corpus, the authors replay prompts through real child accounts, combining API-level iterative prompting and UI interactions to systematically monitor parental alerts triggered by seven categories of risk content. Leveraging PAIR-style prompt optimization, automated evaluators, and human review, the study introduces multidimensional metrics—including notification rate, false negative rate, over-blocking rate, and UI intervention rate. Findings reveal that current systems entirely fail to trigger alerts for high-risk content such as privacy-violating material, violence, fraud, hate speech, and malware, while frequently over-blocking educational content without notifying parents, exposing a critical disconnect between backend moderation policies and parental visibility.
📝 Abstract
We evaluate how effectively platform-level parental controls moderate a mainstream conversational assistant used by minors. Our two-phase protocol first builds a category-balanced conversation corpus via PAIR-style iterative prompt refinement over API, then has trained human agents replay/refine those prompts in the consumer UI using a designated child account while monitoring the linked parent inbox for alerts. We focus on seven risk areas -- physical harm, pornography, privacy violence, health consultation, fraud, hate speech, and malware and quantify four outcomes: Notification Rate (NR), Leak-Through (LR), Overblocking (OBR), and UI Intervention Rate (UIR). Using an automated judge (with targeted human audit) and comparing the current backend to legacy variants (GPT-4.1/4o), we find that notifications are selective rather than comprehensive: privacy violence, fraud, hate speech, and malware triggered no parental alerts in our runs, whereas physical harm (highest), pornography, and some health queries produced intermittent alerts. The current backend shows lower leak-through than legacy models, yet overblocking of benign, educational queries near sensitive topics remains common and is not surfaced to parents, revealing a policy-product gap between on-screen safeguards and parent-facing telemetry. We propose actionable fixes: broaden/configure the notification taxonomy, couple visible safeguards to privacy-preserving parent summaries, and prefer calibrated, age-appropriate safe rewrites over blanket refusals.