Common Benchmarks Undervalue the Generalization Power of Programmatic Policies

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current OOD generalization benchmarks suffer from task-design biases—particularly the absence of algorithm-sensitive tasks—leading to underestimation of procedural strategies’ out-of-distribution (OOD) generalization and implicitly disadvantaging neural policies. Method: We identify this methodological flaw and propose a new evaluation paradigm centered on generalization mechanisms—not model type—introducing task categories requiring explicit data structures (e.g., stacks). Using lightweight sparse-observation neural architectures, safety-guided reward shaping, and a controllable OOD experimental framework, we systematically reproduce and reverse prior conclusions across multiple classic benchmarks. Contribution/Results: We demonstrate that simplified neural models augmented with safety-aware reward design achieve OOD generalization performance on par with procedural strategies—challenging the empirical consensus that “procedural > neural.” This work establishes a mechanism-driven, fair evaluation framework grounded in both theoretical justification and reproducible experimental practice.

Technology Category

Application Category

📝 Abstract

Algorithms for learning programmatic representations for sequential decision-making problems are often evaluated on out-of-distribution (OOD) problems, with the common conclusion that programmatic policies generalize better than neural policies on OOD problems. In this position paper, we argue that commonly used benchmarks undervalue the generalization capabilities of programmatic representations. We analyze the experiments of four papers from the literature and show that neural policies, which were shown not to generalize, can generalize as effectively as programmatic policies on OOD problems. This is achieved with simple changes in the neural policies training pipeline. Namely, we show that simpler neural architectures with the same type of sparse observation used with programmatic policies can help attain OOD generalization. Another modification we have shown to be effective is the use of reward functions that allow for safer policies (e.g., agents that drive slowly can generalize better). Also, we argue for creating benchmark problems highlighting concepts needed for OOD generalization that may challenge neural policies but align with programmatic representations, such as tasks requiring algorithmic constructs like stacks.

Problem

Research questions and friction points this paper is trying to address.

Common benchmarks underestimate programmatic policy generalization

Neural policies can match programmatic OOD generalization with adjustments

New benchmarks should highlight algorithmic concepts challenging neural policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simpler neural architectures enhance OOD generalization

Reward functions promote safer, generalizable policies

Benchmarks with algorithmic constructs challenge neural policies

🔎 Similar Papers

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

2024-10-08arXiv.orgCitations: 1

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

2024-02-08International Conference on Machine LearningCitations: 6

Authors to Follow