Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying multilingual neural machine translation (NMT) on resource-constrained devices—such as IoT endpoints—remains challenging, especially for bidirectional translation between Indian and international languages. Method: This paper proposes an algorithm–hardware co-optimized lightweight NMT system, featuring quantization-aware training combined with sub-eight-bit mixed-precision quantization (FP4/INT4/FP8/INT8), tightly integrated with FPGA architecture for hardware-accelerated inference. Contribution/Results: The system achieves a 4.1× reduction in model size and a 4.2× speedup in inference latency, delivering 66 tokens/s throughput. FPGA resource utilization is significantly improved: LUTs decrease by 1.96× and flip-flops by 1.65×. Compared to OPU and HPTA baselines, throughput increases by 2.2–4.6×, enabling high real-time performance and resource efficiency under ultra-low power constraints.

Technology Category

Application Category

📝 Abstract
This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.
Problem

Research questions and friction points this paper is trying to address.

Develops efficient multilingual translation for resource-limited IoT devices
Investigates ultra-low precision quantization to reduce model size
Optimizes hardware deployment using FPGA accelerators for speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm-hardware co-design for efficient multilingual translation
Ultra-low precision quantization reduces model size and speeds inference
FPGA deployment optimizes resource use for IoT devices
🔎 Similar Papers
No similar papers found.
M
Mukul Lokhande
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
T
Tanushree Dewangan
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
M
Mohd Sharik Mansoori
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
T
Tejas Chaudhari
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
A
Akarsh J.
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
D
Damayanti Lokhande
Independent Researcher
Adam Teman
Adam Teman
Bar Ilan University
Embedded MemoriesEnergy Efficient Circuit DesignDomain-Specific ArchitecturesRISC-VPhysical Design
S
Santosh Kumar Vishvakarma
NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India