ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the lack of a generalizable end-to-end framework for effectively evaluating and optimizing the proactive behavior of task-oriented agents—specifically, their ability to trigger software actions at appropriate moments. The authors propose ProActor, a unified framework that innovatively integrates a RULER-based scoring reward mechanism with a phase-aware composite reward, enabling, for the first time, joint optimization of timing quality and action alignment in proactive behavior. They also introduce a scalable pipeline that automatically generates full opportunity time-window annotations. Leveraging GRPO reinforcement learning, LoRA fine-tuning, and a 4-bit quantized Qwen2.5-14B model—combined with an adaptive inference cluster and a single-node multi-GPU DDP training architecture (ART-F)—the approach achieves significant improvements in proactive timing accuracy on two new datasets while maintaining action consistency and accelerating training efficiency by 4–8×.

📝 Abstract

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

Problem

Research questions and friction points this paper is trying to address.

proactive task scheduling

timing-aware reinforcement learning

anticipatory behavior

opportunity time windows

proactiveness metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive Task Scheduling

Timing-Aware Reinforcement Learning

Automated Annotation