The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Current agent evaluation methods predominantly focus on performance ceilings in static environments, failing to capture robustness in real-world dynamic scenarios—particularly in task scheduling, active exploration, and continual learning. To address this gap, this work proposes EvoEnv, the first dynamic, multi-dimensional evaluation framework tailored to realistic operational settings. EvoEnv simulates “intern” agents continuously exploring and learning within streaming, uncertain environments, assessing capabilities across three dimensions: context-aware scheduling, active information acquisition, and policy generalization. The framework integrates multimodal large language models, dynamic task generation, active exploration mechanisms, and experience distillation to construct a scalable evaluation environment. Experiments demonstrate that state-of-the-art agents exhibit significant performance degradation under dynamic conditions, underscoring EvoEnv’s effectiveness and necessity in evaluating real-world deployment reliability.

Technology Category

Application Category

📝 Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a"trainee"agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

Problem

Research questions and friction points this paper is trying to address.

dynamic task scheduling

active exploration

continuous learning

stochastic environments

workflow automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic evaluation

active exploration

continual learning