🤖 AI Summary
This study addresses the dual challenges of insufficient affective engagement and lack of personalization in second language (L2) learning. We propose an immersive, learner-centered methodology that leverages personal memory and affective expression: learners’ self-recorded videos serve as input; WhisperX enables precise speech-text alignment; keyframes are sampled to extract visual scene features; and Style-BERT-VITS2 synthesizes stylistically appropriate, emotionally congruent spoken questions aligned with visual affect. To our knowledge, this is the first work integrating personal memory media with fine-grained emotional speech synthesis to establish a “vision–affect–language” co-adaptive interaction paradigm. Experimental results demonstrate significant improvements in learners’ emotional resonance and willingness to engage in oral interaction, empirically validating the efficacy and innovation of affective, memory-driven design in L2 acquisition.
📝 Abstract
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.