AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Joint estimation of camera pose and intrinsic parameters in arbitrary dynamic videos remains challenging; existing SfM/SLAM methods lack robustness, while data-driven approaches (e.g., Dust3r) rely on annotated data, suffer from dynamic-object interference, and require iterative optimization. Method: We propose the first end-to-end self-supervised framework: (i) an uncertainty-weighted loss enables label-free training; (ii) a lightweight trajectory refinement module mitigates pose drift; and (iii) integration of pre-trained depth and optical flow networks supports large-scale training on unannotated videos (e.g., YouTube). Built upon a Transformer architecture, it enables feed-forward inference and jointly outputs high-accuracy pose, intrinsics, and 4D point clouds. Contribution/Results: Our method significantly outperforms state-of-the-art approaches on standard benchmarks, achieves an order-of-magnitude speedup in inference, and demonstrates strong generalization to real-world, cluttered, dynamic scenes.

Technology Category

Application Category

📝 Abstract
Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
Problem

Research questions and friction points this paper is trying to address.

Estimating camera poses and intrinsics from casual videos
Handling dynamic scenes without labeled training data
Avoiding drift in camera trajectory over time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer model estimates camera poses directly
Uses uncertainty-based loss and pre-trained networks
Lightweight trajectory refinement prevents drift
🔎 Similar Papers