🤖 AI Summary
Existing stereo vision datasets struggle to simultaneously provide multi-baseline configurations, calibrated intrinsics, dense metric depth, and per-frame poses within a unified and controllable scene, limiting the evaluation of generative models under varying baseline conditions. To address this gap, this work proposes StereoGenBench—the first synthetic multi-camera benchmark built on Unreal Engine—featuring a rigid six-camera array that generates 15 calibrated stereo pairs spanning baselines from human interocular distance to wide configurations, with independently sampled focal lengths. Each frame includes RGB images, metric depth, intrinsic parameters, baseline distances, and camera poses. StereoGenBench enables, for the first time, consistent-scene, multi-baseline-controlled, fully calibrated stereo generation evaluation with dense depth and pose annotations, supporting both narrow- and wide-baseline assessment and training scalability. The dataset, evaluation code, and generation configurations are publicly released on Hugging Face to advance systematic research in stereo generation and geometric estimation.
📝 Abstract
Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset