Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humanoid robots face significant challenges in depression and PTSD psychological screening, including suboptimal dialogue timing, modeling of nonverbal cues (e.g., gaze, pauses, echo), multimodal perception (speech, facial expression, posture), high hardware experimentation costs, and poor generalizability. Method: We propose the first agent-centric virtual simulation training framework, built on Unreal Engine MetaHuman, featuring 276 synchronized multimodal virtual patients. Dialogue turn-taking and trust evolution are formulated as optimizable control variables. Our framework integrates an uncertainty-aware turn manager, safety shielding, clinical protocols (PHQ-8/PCL-C), counterfactual nonverbal perturbations, modality-dropout robust training, and TD3 reinforcement learning. Results: The TD3 controller outperforms PPO and CEM significantly—achieving near-zero turn overlap, high turn-transition alignment accuracy, reduced clarification requests, and shorter waiting times. It demonstrates robustness to renderer substitution and modality loss, and exhibits strong generalization on held-out patient cohorts.

Technology Category

Application Category

📝 Abstract
Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.
Problem

Research questions and friction points this paper is trying to address.

Trains humanoid robots via simulation for mental-health diagnosis without hardware constraints.
Develops a conversational agent to master nonverbal cues and timing for patient interaction.
Compares controllers to optimize diagnostic completeness and social rapport in virtual patients.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtualizes humanoid as conversational agent for hardware-free training
Uses perception-fusion-policy loop with safety shield for interaction decisions
Trains with counterfactual replay and uncertainty-aware turn manager