Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Scientific archives contain vast, heterogeneous data spanning disciplines such as ecology, genomics, and climate science; however, existing methods rely on predefined objectives and thus struggle to support open-ended discovery of unknown patterns. This paper proposes an unsupervised decomposition of vision foundation model representations using sparse autoencoders (SAEs) to enable open-ended scientific feature discovery. Our method requires neither semantic segmentation nor part-level annotations, instead integrating concept-alignment evaluation with label-free contrastive learning to automatically extract semantically coherent, anatomy-level features. We demonstrate—on real-world ecologically annotated images—for the first time that our approach discovers fine-grained, previously unannotated anatomical structures. It also achieves significant improvements over baselines on standard segmentation benchmarks. By departing from conventional validation paradigms, our work establishes a scalable, interpretable framework for genuine, data-driven scientific discovery across diverse domains.

Technology Category

Application Category

📝 Abstract

Scientific archives now contain hundreds of petabytes of data across genomics, ecology, climate, and molecular biology that could reveal undiscovered patterns if systematically analyzed at scale. Large-scale, weakly-supervised datasets in language and vision have driven the development of foundation models whose internal representations encode structure (patterns, co-occurrences and statistical regularities) beyond their training objectives. Most existing methods extract structure only for pre-specified targets; they excel at confirmation but do not support open-ended discovery of unknown patterns. We ask whether sparse autoencoders (SAEs) can enable open-ended feature discovery from foundation model representations. We evaluate this question in controlled rediscovery studies, where the learned SAE features are tested for alignment with semantic concepts on a standard segmentation benchmark and compared against strong label-free alternatives on concept-alignment metrics. Applied to ecological imagery, the same procedure surfaces fine-grained anatomical structure without access to segmentation or part labels, providing a scientific case study with ground-truth validation. While our experiments focus on vision with an ecology case study, the method is domain-agnostic and applicable to models in other sciences (e.g., proteins, genomics, weather). Our results indicate that sparse decomposition provides a practical instrument for exploring what scientific foundation models have learned, an important prerequisite for moving from confirmation to genuine discovery.

Problem

Research questions and friction points this paper is trying to address.

Enabling open-ended discovery of unknown patterns in scientific data

Extracting features from foundation models without pre-specified targets

Developing domain-agnostic methods for scientific discovery across disciplines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders enable open-ended feature discovery

Method extracts features without pre-specified targets

Domain-agnostic approach applicable across scientific domains

🔎 Similar Papers

No similar papers found.