Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians

πŸ“… 2025-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenging problem of scene reconstruction and novel view synthesis (NVS) from sparse, uncalibrated viewsβ€”where neither camera poses nor intrinsic parameters are known a priori. We propose an end-to-end unified framework that jointly optimizes 3D Gaussian splatting representations and full camera parameters. Our key contributions are: (1) treating camera parameters and 3D Gaussians as parallel, differentiable learnable queries; (2) introducing Camera-aware Multi-view Deformable Cross-Attention (CaMDFA) to strengthen geometry-rendering coupling; and (3) enforcing pose stability via Ray Reference Point (RayRef)-guided RQ decomposition constraints on camera extrinsics. Crucially, our method requires no initial pose estimates, external Structure-from-Motion (SfM), or pose priors. On RealEstate10K and ACID benchmarks, it significantly outperforms both pose-initialized and pose-free state-of-the-art methods, achieving superior 3D reconstruction accuracy and photorealistic novel view rendering quality.

Technology Category

Application Category

πŸ“ Abstract
In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.
Problem

Research questions and friction points this paper is trying to address.

Jointly optimizes camera parameters and 3D Gaussians for reconstruction
Enables sparse view pose-free scene reconstruction and synthesis
Improves accuracy via camera-aware multi-view deformable cross-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint optimization of camera parameters and 3D Gaussians
Deformable Transformer layers for layer-by-layer updates
Camera-aware multi-view deformable cross-attention (CaMDFA)
πŸ”Ž Similar Papers
No similar papers found.
J
Jiamin Wu
Hong Kong University of Science and Technology, International Digital Economy Academy (IDEA)
H
Hongyang Li
International Digital Economy Academy (IDEA)
Xiaoke Jiang
Xiaoke Jiang
Reseach@IDEA
Computer VisionIndustrial VisionComputer Networking
Y
Yuan Yao
Hong Kong University of Science and Technology
L
Lei Zhang
International Digital Economy Academy (IDEA)