🤖 AI Summary
Hearing-impaired children face significant speech perception challenges in classrooms due to background noise, overlapping multi-talker speech, and strong reverberation.
Method: We propose a real-time binaural speech separation method tailored for pediatric hearing aids. Our approach innovatively modifies the MIMO-TasNet architecture by incorporating a spatial-aware module to preserve binaural cues and constructs a high-fidelity classroom acoustic simulation dataset. We adopt a transfer learning strategy—pretraining on adult speech followed by few-shot fine-tuning on child speech—achieving saturation performance with only 50% of the training data.
Results: Experiments demonstrate substantial improvements over baselines under both realistic classroom noise and diffuse noise conditions, with consistent gains in speech quality (SI-SNRi), intelligibility (STOI), and spatial cue fidelity. The model exhibits strong robustness to speaker distance variations. To our knowledge, this is the first low-latency, high-fidelity, and pediatric-specific speech separation solution designed explicitly for children’s hearing aids.
📝 Abstract
Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children's voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children's speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.