🤖 AI Summary
Existing vision-language models exhibit limited performance on multi-step complex reasoning tasks, primarily due to conventional reward mechanisms that provide only coarse-grained, binary global scores—lacking fine-grained, verifiable feedback on subproblem correctness.
Method: We propose StructVRM, a structured and verifiable reward model that integrates semantic parsing with mathematical logical equivalence verification to assign subproblem-level, formally verifiable partial scores—overcoming the limitations of string matching and global scoring. Its model-driven, end-to-end reward modeling framework enables fine-grained optimization of reasoning paths in reinforcement learning.
Contribution/Results: Evaluated on 12 public multimodal benchmarks, Seed-StructVRM achieves state-of-the-art performance on 6, and significantly outperforms prior methods on our newly constructed high-difficulty STEM-Bench. To our knowledge, this is the first work to realize structured, verifiable reward alignment for multimodal reasoning processes.
📝 Abstract
Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains.