Common Benchmarks Undervalue the Generalization Power of Programmatic Policies

šŸ“… 2025-06-17
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Current OOD generalization benchmarks suffer from task-design biases—particularly the absence of algorithm-sensitive tasks—leading to underestimation of procedural strategies’ out-of-distribution (OOD) generalization and implicitly disadvantaging neural policies. Method: We identify this methodological flaw and propose a new evaluation paradigm centered on generalization mechanisms—not model type—introducing task categories requiring explicit data structures (e.g., stacks). Using lightweight sparse-observation neural architectures, safety-guided reward shaping, and a controllable OOD experimental framework, we systematically reproduce and reverse prior conclusions across multiple classic benchmarks. Contribution/Results: We demonstrate that simplified neural models augmented with safety-aware reward design achieve OOD generalization performance on par with procedural strategies—challenging the empirical consensus that ā€œprocedural > neural.ā€ This work establishes a mechanism-driven, fair evaluation framework grounded in both theoretical justification and reproducible experimental practice.

Technology Category

Application Category

šŸ“ Abstract
Algorithms for learning programmatic representations for sequential decision-making problems are often evaluated on out-of-distribution (OOD) problems, with the common conclusion that programmatic policies generalize better than neural policies on OOD problems. In this position paper, we argue that commonly used benchmarks undervalue the generalization capabilities of programmatic representations. We analyze the experiments of four papers from the literature and show that neural policies, which were shown not to generalize, can generalize as effectively as programmatic policies on OOD problems. This is achieved with simple changes in the neural policies training pipeline. Namely, we show that simpler neural architectures with the same type of sparse observation used with programmatic policies can help attain OOD generalization. Another modification we have shown to be effective is the use of reward functions that allow for safer policies (e.g., agents that drive slowly can generalize better). Also, we argue for creating benchmark problems highlighting concepts needed for OOD generalization that may challenge neural policies but align with programmatic representations, such as tasks requiring algorithmic constructs like stacks.
Problem

Research questions and friction points this paper is trying to address.

Common benchmarks underestimate programmatic policy generalization
Neural policies can match programmatic OOD generalization with adjustments
New benchmarks should highlight algorithmic concepts challenging neural policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simpler neural architectures enhance OOD generalization
Reward functions promote safer, generalizable policies
Benchmarks with algorithmic constructs challenge neural policies
šŸ”Ž Similar Papers
A
Amirhossein Rajabpour
Amii, Department of Computing Science, University of Alberta
K
Kiarash Aghakasiri
Amii, Department of Computing Science, University of Alberta
Sandra Zilles
Sandra Zilles
Professor, Computer Science, University of Regina
Computational Learning TheoryArtificial Intelligence
L
Levi H. S. Lelis
Amii, Department of Computing Science, University of Alberta