VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing in-vehicle agent evaluation benchmarks, which are confined to single-user static question-answering and fail to assess long-term memory and decision-making capabilities under multi-user preference conflicts and dynamic behavioral patterns. To bridge this gap, we propose the first executable long-term memory evaluation framework tailored for in-vehicle multi-user scenarios. Built upon a simulated environment, our framework integrates 23 tool modules and over 80 historical memory events, enabling objective and reproducible assessment through automatic comparison between environmental states and target states. Experimental results demonstrate that while state-of-the-art models effectively execute direct commands, they exhibit significant shortcomings in handling evolving user preferences and managing domain-specific memory over time.

Technology Category

Application Category

📝 Abstract
With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions. This evolution requires agents to continuously model multi-user preferences and make reliable decisions in the face of inter-user preference conflicts and changing habits over time. However, existing benchmarks are largely limited to single-user, static question-answer settings, failing to capture the temporal evolution of preferences and the multi-user, tool-interactive nature of real vehicle environments. To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing the post-action environment state with a predefined target state, enabling objective and reproducible evaluation without LLM-based or human scoring. VehicleMemBench includes 23 tool modules, and each sample contains over 80 historical memory events. Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment. These findings highlight the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. To facilitate future research, we release the data and code.
Problem

Research questions and friction points this paper is trying to address.

multi-user memory
long-term memory
in-vehicle agents
preference evolution
executable benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-user memory
executable benchmark
in-vehicle agents
long-term memory
tool-use evaluation
🔎 Similar Papers
No similar papers found.
Yuhao Chen
Yuhao Chen
University of Science and Technology of China
Large Language Model
Y
Yi Xu
University of Science and Technology of China
X
Xinyun Ding
iFLYTEK Research
X
Xiang Fang
iFLYTEK Research
Shuochen Liu
Shuochen Liu
University of Science and Technology of China
Large Language Model
L
Luxi Lin
Xiamen University
Qingyu Zhang
Qingyu Zhang
Institute of Software, Chinese Academy of Sciences
Y
Ya Li
iFLYTEK Research
Q
Quan Liu
University of Science and Technology of China
Tong Xu
Tong Xu
Professor, University of Science and Technology of China
Data Mining