🤖 AI Summary
This study addresses the inverse problem of inferring centromere positions from Hi-C interaction maps, which is challenged by the variable number and size of genomic entities. The work introduces, for the first time, a Transformer-based architecture tailored to this task, leveraging shared structural patterns across interaction maps—such as global alignment of local motifs—and augmented with a custom simulator that generates large-scale, low-cost synthetic data to mitigate the variability inherent in real datasets. Evaluated across diverse genomes spanning multiple species, the method accurately recovers centromere locations, demonstrating strong generalization and high prediction accuracy. This approach establishes a novel, data-driven paradigm for solving inverse problems in interaction graphs characterized by variable length and entity count.
📝 Abstract
Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.