When the Code Autopilot Breaks: Why LLMs Falter in Embedded Machine Learning

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This paper addresses the prevalent “silent failures” and unpredictable behavior of large language models (LLMs) when auto-generating code for embedded machine learning (ML) workflows. We propose a closed-loop evaluation framework covering data preprocessing, model conversion, and on-device inference code generation. Through multi-model empirical analysis, we introduce the first failure taxonomy for LLM-generated code in embedded ML, identifying systemic fragility arising from prompt format bias, implicit structural assumptions encoded in LLMs, and blind spots in compilation- and runtime-level validation. Key failure patterns include format-misleading parsing errors and “compilable-yet-functionally-broken” runtime errors—both largely undetectable by conventional verification methods. Our findings provide both theoretical foundations and practical guidelines for enhancing the reliability, traceability, and robustness of LLM-driven embedded ML systems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used to automate software generation in embedded machine learning workflows, yet their outputs often fail silently or behave unpredictably. This article presents an empirical investigation of failure modes in LLM-powered ML pipelines, based on an autopilot framework that orchestrates data preprocessing, model conversion, and on-device inference code generation. We show how prompt format, model behavior, and structural assumptions influence both success rates and failure characteristics, often in ways that standard validation pipelines fail to detect. Our analysis reveals a diverse set of error-prone behaviors, including format-induced misinterpretations and runtime-disruptive code that compiles but breaks downstream. We derive a taxonomy of failure categories and analyze errors across multiple LLMs, highlighting common root causes and systemic fragilities. Though grounded in specific devices, our study reveals broader challenges in LLM-based code generation. We conclude by discussing directions for improving reliability and traceability in LLM-powered embedded ML systems.

Problem

Research questions and friction points this paper is trying to address.

Investigating failure modes in LLM-powered embedded ML pipelines

Analyzing how prompt format and model assumptions cause silent failures

Identifying error-prone behaviors that evade standard validation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical investigation of LLM failure modes

Autopilot framework orchestrating ML workflows

Taxonomy of error categories across LLMs

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models