Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

📅 2026-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the lack of full-duplex spoken dialogue systems supporting Indian languages such as Hindi, which struggle to model natural conversational phenomena like interruptions, overlaps, and turn-taking. Building upon the Moshi architecture and leveraging 26,000 hours of real-world spontaneous dialogue data with separate speaker channels, we present the first open-source, reproducible full-duplex dialogue system for Hindi. We introduce a custom Hindi tokenizer and employ a two-stage training strategy: preserving the pretrained audio module while reinitializing textual parameters, followed by large-scale pretraining and fine-tuning on over a thousand hours of conversational data. Experimental results demonstrate that the system generates fluent, contextually appropriate full-duplex interactions, achieving strong performance on both automatic metrics and human evaluations, thereby establishing a foundation for real-time spoken dialogue systems in Indian languages.

Technology Category

Application Category

📝 Abstract
Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.
Problem

Research questions and friction points this paper is trying to address.

full-duplex
spoken dialogue systems
Hindi
conversational modeling
Indian languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex dialogue
Hindi spoken language
real-world conversations
turn-taking modeling
two-stage training
🔎 Similar Papers
No similar papers found.