09-12, 12:00–12:15 (Europe/Berlin), Main conference room - MW 0350
Imaging-based spatial omics datasets present challenges in reliably segmenting single cells. Achieving accurate segmentation at single-cell resolution is crucial to unravelling multicellular mechanisms and understanding cell-cell communications in spatial omics studies. Despite the considerable progress and the variety of methods available for cell segmentation, challenges persist, including issues of over-segmentation, under-segmentation, and contamination from neighbouring cells. While combining multiple segmentation methods with distinct advantages has been proposed, it does not completely resolve these issues. Additionally, scalability remains an obstacle, particularly when applying these methods to larger tissues and gene panels in targeted studies.
Here we introduce Segger, a cell segmentation model designed for single-molecule resolved datasets, leveraging the co-occurrence of nucleic and cytoplasmic molecules (e.g., transcripts). It employs a heterogeneous graph structure on molecules and nuclei, integrating fixed-radius nearest neighbor graphs for nuclei and molecules, along with edges connecting transcripts to nuclei based on spatial proximity. A heterogeneous graph neural network (GNN) is then used to propagate information across these edges to learn the association of molecules with nuclei. Post-training, the model refines cell borders by regrouping transcripts based on confidence levels, overcoming issues like nucleus-less cells or overlapping cells. Benchmarks on 10X Xenium and MERSCOPE technologies reveal Segger's superiority in accuracy and efficiency over other segmentation methods, such as Baysor, Cellpose, BIDcell, and simple nuclei-expansion. Segger can be pre-trained on one or more datasets and fine-tuned with new data, even acquired via different technologies. Its highly parallelizable nature allows for efficient training across multiple GPU machines, facilitated by recent graph learning techniques. Compared to other model-based methods like Baysor and BIDCell Segger's training is orders of magnitude faster, and more accurate making it ideal for integration into preprocessing pipelines for comprehensive spatial omics atlases.
Despite advancements in segmentation methods, significant gaps persist. Current methods effectively resolve overall cellular identities and heterogeneity but often misassign transcripts to neighboring cells, impacting the inference of cell-cell communication—a crucial aim of subcellular resolution spatial transcriptomics. Moreover, image-based methods that use nuclei as anchors frequently fail to identify 'nucleus-less' cells, either due to the absence of a nucleus or its non-capture in the image.
To address these limitations, we introduce Segger, which leverages graph neural networks to accurately segment cells by incorporating transcript colocalization and nucleus image data, ensuring robust performance across diverse datasets and complex tissue environments. While existing methods are limited to transcripts only in the vicinity of nuclei, Segger extends this to include nucleus-free transcripts in proximity, greatly enhancing the coverage of assignable transcripts and cell recovery.
The first step in Segger involves creating local heterogeneous graphs that cover both nuclei and nearby transcripts. Segger uses a heterogeneous graph with two types of nodes: transcripts and nuclei. This yields three types of edges: transcript-to-transcript, transcript-to-nucleus, and nucleus-to-nucleus. Semantically separating these entities informs the heterogeneous graph neural network to use different mechanisms to model different relationships. This helps capture not only transcript colocalization patterns but also the colocalization of certain types of nucleic and cytoplasmic transcripts, which is lacking in all previous methods. The graph structure derived from non-nucleus overlapping transcripts also provides a definition for likely nucleus-free cells, a capability missing in other methods.
Transcripts overlapping with nuclei are used as positive connections, whereas those overlapping other nuclei serve as negative examples, allowing the training of a heterogeneous GNN. Segger's hyperparameters can be optimized using classical cross-validation with spatial mini-batching. The model is parallelizable and leverages GPU capabilities, facilitating application across multiple slides, with or without incorporating transcript embeddings learned from scRNA-seq or other modalities.
Segger utilizes advanced deep learning technologies such as PyTorch and PyTorch Geometric for efficient GNN implementation and takes advantage of GPUs for accelerated computation. This results in significantly faster performance compared to Baysor and other transcript-colocalization methods, making it practical for large sections with many transcripts. Additionally, Segger's framework allows for efficient data loading and processing, enhancing its usability for large-scale datasets. One of Segger's unique features is its transferability; it can be trained on one tissue section and applied to another section with the same panel of transcripts, maintaining high accuracy and efficiency. This combination of speed, robustness, and comprehensive data integration makes Segger a superior choice for cell segmentation in subcellular spatial transcriptomics, advancing the field towards more precise and reliable spatial analysis.
No previous knowledge expected
Elyas Heidari is a PhD student specialising in AI in Oncology at the German Cancer Research Center (DKFZ) in Heidelberg. He completed his Bachelor's in Computer Science and Mathematics at Sharif University of Technology in Iran and holds a Master’s degree from ETH Zurich, department of Biological Science and Systems Engineering. During his internship at EMBL-EBI, he focused on spatial omics and computational genomics. Elyas has developed packages such as MUVIS (R) and SageNet (python) for biomedical data science and machine learning and is currently interested in developing computational methods for the analysis and integration of spatial omics data at scale.