HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the absence of a general-purpose pre-trained backbone for ground-level hyperspectral imaging, hindered by heterogeneous sensor spectral configurations, scarce and inconsistent annotations, limited dataset scale, and scene homogeneity. To overcome these challenges, we propose HyperVision—the first universal pre-trained backbone for ground hyperspectral vision—featuring a novel channel-adaptive dynamic embedding module to unify heterogeneous inputs. HyperVision leverages multi-source pseudo-labels generated by integrating SAM2 and HyperFree, and employs cross-modal knowledge distillation from RGB models to hyperspectral representations, enabling efficient transfer with only head fine-tuning. Evaluated on three downstream tasks—semantic segmentation, object tracking, and salient object detection—HyperVision achieves state-of-the-art performance, improving Acc_M by 16.3%, AUC by 2.1%, and reducing MAE by 35.5%.

📝 Abstract

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, to address the scarcity and inconsistency of labels, we introduce a multi-source pseudo-labeling method that fuses semantic representations from both spatial structures generated by SAM2 and fine-grained spectral material information extracted by HyperFree. Third, to compensate for limited dataset scale and enrich scene diversity, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our hyperspectral backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available at https://github.com/lronkitty/HyperVision .

Problem

Research questions and friction points this paper is trying to address.

hyperspectral imaging

pre-trained backbone

spectral configuration

label scarcity

dataset diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

channel-adaptive embedding

multi-source pseudo-labeling

cross-modal knowledge distillation