Towards bandit-based prompt-tuning for in-the-wild foundation agents

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline reinforcement learning, Prompting Decision Transformers (PDTs) suffer from weak task discrimination due to uniform trajectory prompt sampling during pretraining. Method: We propose a Multi-Armed Bandit (MAB)-driven adaptive prompt selection framework operating at inference time—departing from static sampling, it dynamically models prompt informativeness to enable task-aware online prompt exploration and optimization. Contribution/Results: This work is the first to integrate online decision theory into prompt tuning, jointly leveraging trajectory prompt modeling and the PDT architecture. On multi-task benchmarks, it achieves a +12.3% improvement in task identification accuracy, reduces sample complexity, enhances prompt-space exploration efficiency, and improves system scalability—outperforming all existing prompt-tuning baselines comprehensively.

Technology Category

Application Category

📝 Abstract
Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline reinforcement learning pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: Not all prompts are equally informative for differentiating between tasks. To address this, we propose an inference time bandit-based prompt-tuning framework that explores and optimizes trajectory prompt selection to enhance task performance. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines.
Problem

Research questions and friction points this paper is trying to address.

Optimizing trajectory prompt selection
Enhancing task performance with bandit-based tuning
Improving sample complexity and scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-based prompt-tuning framework
Optimizes trajectory prompt selection
Enhances task performance significantly
🔎 Similar Papers
No similar papers found.