Benchmarking Local Language Models for Social Robots using Edge Devices

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of language models deployed on edge devices for educational social robots, where balancing response latency, privacy preservation, and pedagogical efficacy remains challenging. The authors introduce the first benchmarking framework tailored to educational robotics, locally deploying 25 open-source models on resource-constrained hardware such as Raspberry Pi. Performance is assessed across three dimensions: inference efficiency, general knowledge (via a subset of MMLU), and teaching effectiveness (using LLM-based automated scoring validated by human raters). A novel three-tiered local inference architecture is proposed to accommodate stringent resource constraints. Experimental results reveal performance disparities exceeding an order of magnitude across models; Granite-4-Tiny-Hybrid (7B) emerges as the best overall performer, achieving 2.5 tokens/s, 0.90 tokens/J, and 54.6% accuracy on MMLU, with automated scores showing strong agreement with human evaluations (r = 0.967).

📝 Abstract

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.

Problem

Research questions and friction points this paper is trying to address.

local language models

edge computing

social robots

pedagogical applications

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

edge AI

language model benchmarking

social robots

on-device inference

pedagogical effectiveness

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

Authors to Follow