VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

πŸ“… 2026-01-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of repairing severely corrupted or incomplete voxel models guided by multi-view images, proposing VIAFormerβ€”a novel framework that fuses calibrated multi-view images with voxel data to achieve high-fidelity geometric reconstruction. The method introduces an image indexing mechanism with explicit 3D spatial localization to enable precise 2D–3D alignment, formulates a rectified flow optimization objective to learn direct repair trajectories, and employs a hybrid-flow Transformer for effective cross-modal feature fusion. Extensive experiments demonstrate that VIAFormer achieves state-of-the-art performance on both synthetically degraded voxels and those generated by real-world vision foundation models, significantly improving geometric completeness and detail fidelity. Furthermore, its successful integration into practical 3D content creation pipelines underscores its effectiveness and real-world applicability.

Technology Category

Application Category

πŸ“ Abstract
We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.
Problem

Research questions and friction points this paper is trying to address.

Voxel Refinement
Multi-view Images
3D Reconstruction
Noise Correction
Cross-modal Fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voxel-Image Alignment
Multi-view Conditioned Refinement
Correctional Flow
Hybrid Stream Transformer
3D Spatial Grounding
πŸ”Ž Similar Papers
2024-07-16European Conference on Computer VisionCitations: 1