🤖 AI Summary
This work proposes a data-efficient digital twin approach for millimeter-wave channel modeling that circumvents the high deployment costs of conventional methods relying on extensive measurements or hand-tuned material models. By leveraging a frozen vision-language model, the method extracts semantic embeddings from ordinary multi-view images and translates them into priors for electromagnetic material parameters. These priors are integrated with differentiable ray tracing—implemented via Sionna—and calibrated using only sparse channel measurements through gradient-based optimization. The framework enables cross-scenario transferability and, in three real-world environments, achieves accurate channel characterization with merely tens of probe measurements—reducing measurement requirements by an order of magnitude compared to purely data-driven baselines and decreasing median delay spread error by 59%.
📝 Abstract
Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.