scverse Conference 2024
Welcome to the first scverse conference!
Updates on scverse packages
Keynote presentation
Scirpy is a scverse core package for single-cell analysis of adaptive immune receptor repertoires (AIRR) in Python. A key functionality is the “clonotype clustering” step that aims to identify immune cells likely to recognize the same antigen or have a common ancestor. As dataset sizes increased to millions of cells thanks to improvements in single-cell technologies, clonotype clustering can become a bottleneck in the analysis pipeline. This project aimed at making it feasible to analyze large-scale, single-cell AIRR datasets with scirpy.
There are two crucial steps to clonotype clustering: 1) computing distances between complementarity-determining region 3 (CDR3) sequences based on a similarity metric, e.g. Hamming distance; 2) identifying clusters based on the receptor configurations, i.e. identity/similarity between single-cells in terms of CDR3 sequences and V/J gene usage.
We integrated tcrdist3, a runtime-optimized implementation of TCRdist metric, as an alternative CDR3-sequence similarity metric. The original implementation uses the Numba library for just-in-time compilation of Python code. To overcome previous limitations with large datasets, we adapted the original tcrdist3 implementation to operate with sparse matrices. We improved the performance of the Hamming distance calculation in scirpy with a similar Numba implementation. For the identification of the clusters, we reimplemented the existing algorithms optimizing sparse matrix operations.
For 1 million T cells with 64 CPU cores, the improved version of the Hamming distance calculation achieved a speedup of ~20.8 (43 seconds vs. 893 seconds). This calculation took 210 seconds with the TCRdist method. Tested with only 140 thousand T cells due to memory constraints of the old version, the improved version of the cluster identification achieved a speedup of ~29.9 (8 seconds vs. 239 seconds) with 8 CPU cores. We successfully tested the improved version on up to 8 million T cells, which was infeasible using the previous implementations due to memory and/or runtime constraints.
These enhancements significantly improve scirpy’s capability to analyze large-scale single-cell RNA sequencing data, facilitating more efficient and scalable clonotype clustering. Future work to increase scirpy’s performance will focus on providing GPU support.
In the past few years, we have witnessed the exciting development of several single-cell epigenomics methods such as scATAC-seq, scBS-seq or scCUTnRUN/CUTnTag. These new technologies have the potential to enable a quantitative analysis of the epigenetic landscape of cells and resolve tissue heterogeneity. However, despite the rising popularity of single-cell epigenomic protocols, issues related to data analysis could limit their broad adoption among biologists: 1) lack of quality control, normalization, and downstream analysis methods tailored to epigenomic data 2) lack of tools that extend beyond droplet-based scATAC-seq methods, 3) need of significant scripting/programming knowledge. We introduce the Single-Cell Informatics (sincei) toolkit, that tackles the above challenges. sincei provides an easy-to-use, command-line interface for the exploration of data from a wide range of single-cell epigenomics and transcriptomics protocols directly from BAM files. It adopts bulk-epigenomics analysis standards into single-cell genomics workflows and simplifies data integration. sincei is available open-source at https://github.com/bhardwaj-lab/sincei.
Unleashing the potential of multiplexed imaging experiments
anndataR aims to make it easy to work with anndata files in R, either by converting it to a SingleCellExperiment or Seurat object, or by interacting with them directly.
Existing single-cell tools and packages most often live in one of three pre-existing ecosystems. The underlying data structures and on-disk formats differ between these frameworks, making smooth interoperability difficult.
Community-based efforts exist to make this conversion process easier, but they rely on using reticulate
and while they can make the conversion process easier, it is far from foolproof: due to the differences between the data structures and philosophies underlying the frameworks, lossless conversion is not possible in all cases.
That is where anndataR comes in: it aims to both make anndata a first-class citizen in R, and to provide conversion functions to and from SingleCellExperiment and Seurat classes. Using anndataR, you can read and write h5ad
files in R without using reticulate and access anndata slots.
Keynote presentation
Talks
Intra-tumoral heterogeneity contributes to low survival rates in pancreatic ductal adenocarcinoma (PDAC). Using in situ sequencing and an interdisciplinary analysis approach, we will map tumor subtypes, immune cells, and fibroblasts in a large cohort of PDAC patients to study cellular processes in primary tumors and metastases and decipher the impact of tumor architecture on treatment response.
Colorectal cancer (CRC) features molecular heterogeneity and differing immune cell infiltration patterns, affecting how patients respond to therapy. However, a comprehensive understanding of immune cell phenotypes across patient subgroups is missing. Using various "scverse" tools, we dissect the CRC tumor microenvironment by integrating 4,27 million single cells from 1670 samples and 650 patients across 49 studies (77 datasets) representing 7 billion expression values. These samples encompass the full spectrum of disease progression, from normal colon to polyps, primary tumors, and metastases, covering both early and advanced stages of CRC. Additionally, to address the limited availability of neutrophil single-cell data, we supplemented the atlas by analyzing samples from 12 CRC patients using a platform that captures cells with very low transcript counts. We present a high-resolution view of CRC with 62 major cell types or states, revealing different cell-type composition patterns in CRC subtypes. The atlas allows for refined tumor classification and patient stratification into four immune phenotypes: immune desert, myeloid, B cell, and T cell subtypes. These findings could have important implications for developing improved cancer immunotherapy approaches in CRC.
https://crc.icbi.at/
Background. Integration analysis not only offer a panoramic view by making annotation consistent across studies, but also amplify the statistical power restricted by sample size in the individual studies. Commonly used integration methods for scRNA-seq data often struggle with strong batch effects, which can distort or dilute the biological signals. Furthermore, conventional clustering built on improper integration can lead to impure clusters containing multiple cell types, compromising the purity and interpretability of the data.
Methods. To circumvent these issues, we employed a two-step solution that minimizes the need for data integration. In the first step, a combination of cell annotation techniques are used to identify high-level cellular compartments. In the second step we mainly utilize SingleR augmented with curated references from pan-cancer studies in a hierarchical framework. The clustering-free second step, which we term deep-phenotyping, is particularly advantageous for resolving cell states.
Results. We applied this computational framework to annotate 11 scRNA-seq datasets of patients treated with immune checkpoint inhibitors (IBI) with multiple timepoints. Altogether our dataset included longitudinally paired samples from 163 patients. We accurately portrayed the complex landscape of diverse cellular states in the tumor microenvironment (TME) at an individual patient level. Our analysis revealed consistent compositional changes in 19 cell subtypes following ICI treatment. We uncovered co-regulated cell communities within the TME, highlighting the coordinated interplay between adaptive and innate immune cells, as well as immune and non-immune components. Furthermore, we identified two distinct patient groups exhibiting tightly correlated cellular dynamics within the TME post-treatment. The first group, enriched for responders, displayed a marked expansion of naive lymphocytes, while the second group, predominantly composed of non-responders, showed an increased abundance of immune experienced/suppressive cell states. This dichotomy in TME dynamics offers a potential predictive biomarker for patient stratification and personalized therapeutic strategies.
Conclusions. Our study presents a comprehensive landscape of the cellular dynamics within the TME during ICI treatment, enabled by a powerful deep phenotyping approach showcasing the importance of a systems-level understanding of the TME dynamics in improving patient stratification and advancing personalized cancer immunotherapy.
Containerization has fueled a movement around virtualization that has impacted nearly the entire landscape of software development ecosystems. The ability to enable simple isolation of both environment and tool within a portable, lightweight package has completely reshaped many aspects of the software management paradigm. This overhaul is not limited to production workflows as low-friction virtualization has created established use cases for containerized management across development, execution, testing, interactive compute, distribution, and deployment. Even with these advancements, utilization of these conventions remains limited in some areas of research computing and bioinformatics, especially single cell biology.
Here, we present SCleeStacks - an effort to expand the accessibility of containerized workflows in the scverse. The goals of this project are to: 1) standardize and simplify the containerization and dependency landscape across the scverse ecosystem; 2) lower barriers to scverse environment sharing across teams and communities; 3) promote improvements to research software methods, publishing, and reproducibility; 4) foster upstream compatibility for development and execution of interdependent tool combinations; and 5) expand opportunities for use of scverse tools in higher-order execution frameworks such as Nextflow and Snakemake.
The central outcome in achieving these goals is the establishment of a repository of maintained, tool-level images across scverse ecosystem packages. As many uses of scverse tools focus on multi-modal data that require an intersection of package functions, a class of these images attempt to offer multi-layered tool stacks built around common workflows or data type modalities. Image structure follows common best practices in provisioning for security, performance, and compatibility through the Docker framework. Singularity is embraced for use in non-root or restricted HPC environments. A critical element of this implementation is a common design language to foster both contributions to the creation of images as well as their use. We expect a common application to be interactive computing during analysis and pipeline development. To address this, a specific emphasis is placed on documentation around workflows like IDE dev containers in tools such as VS Code, JetBrains, or Jupyter Notebooks. Where possible, authors hope to sustain an open dialogue with upstream tool maintainers and the wider community, striving for collaboration that can elevate tool harmony and integration. Much like any community development effort, we expect the evolution of this resource to be guided by the needs of users and the field. With that, we hope the efforts behind this project offer something of value and can serve the growing needs of the scverse community.
Recent advances in spatial proteomics have enabled highly-multiplexed profiling of tissues at single-cell resolution on an unprecedented scale. However, various technical challenges such as autofluorescence, the sheer size of the images, and batch effects across samples impede the automated and reproducible analysis of antibody-based imaging approaches. To address these issues, we developed spatialproteomics, a Python package designed to streamline the (pre-) processing of multiplexed imaging data. Spatialproteomics provides a unified API for various image processing algorithms, ranging from nuclei segmentation to cell type prediction, and supports data visualisation through an easy-to-use plotting interface. For scalability, spatialproteomics leverages xarray and dask, enabling to lazy-load and process larger-than-memory data in parallel. Finally, we ensure interoperability with existing scverse tools to facilitate spatial downstream analyses. In summary, spatialproteomics offers a fast and flexible alternative to rigid image preprocessing pipelines, thus facilitating rapid-prototyping and custom preprocessing for highly multiplexed imaging data.
Heart attacks are one of the leading causes of death worldwide, affecting almost 3 million people annually. Several open questions regarding disease progression and post-infarction tissue remodeling can only be answered by analyzing intact tissues. A new generation of assays now makes it possible to measure RNA in patient material with high multiplexity and at subcellular resolution.
We have assembled one of the largest human datasets to date. The Xenium and MERFISH technologies allowed us to measure more than 50 samples from patients suffering acutely and chronically from the effects of myocardial infarction, as well as healthy control tissues. This wealth of new data at unprecedented resolution enables novel types of analysis. We approached this challenge with bespoke optimal transport algorithms that help us integrate single-cell RNA sequencing (scRNA-seq) and spatial transcriptomic datasets and analyze them at the cohort level.
Our approach allows us to identify tipping points in the disease trajectory that could potentially be used for therapeutic intervention. Crucially, our measurements are well-suited to record cellular communication events, identify spatial domains around the infarct border, and even decipher RNA localization patterns inside cells—areas of analysis that were closed off in more coarsely resolved assays..
We aim to illustrate what this new generation of spatial experiments can offer and share key learnings from processing these novel datasets.
Poster session - uneven numbers
Poster session - even numbers
scverse opening day 2 - housekeeping
DNA barcodes and their decoding are at the core of high-throughput molecular biology. Large-scale assays such as pooled CRISPR screens and their single-cell manifestations are pushing the boundaries of what is accurately feasible with current barcode decoding technology, as we ever demand the capability to identify more perturbations, more cells, more omics, more insights in the same assay. Despite the critical nature of the decoding task, few methods exist that permit accurate, flexible and versatile decoding in the more demanding of these applications. Current state-of-the-art is many tools for individual tasks, often with severe limitations in their applicability, and often with inflexible vendor-specific solutions. Here, we propose a principled information theory-based barcode decoding software solution that addresses a wide range of decoding tasks in a unified framework, including demanding applications. Our approach can handle flexible error models, including those encountered in long-read sequencing such as Nanopore, and supports decoding arbitrary and non-trivial constructions of multiple barcodes, linkers, UMIs, and logical combinations thereof. Finally, we present a graphical user interface for clear, user-friendly specification of each barcode decoding task.
In the era of single-cell screenings, RNA-based technologies have been the gold standard for the systematic study of cell state heterogeneity and differentiation dynamics of complex cellular systems. Recently, single-cell proteomics by mass spectrometry (scp-MS) has reached a maturity level where it can reliably measure the proteomes of thousands of individual cells in an unbiased manner.
Here, we leverage these technological advances for the characterization of a primary acute myeloid leukemia (AML) sample obtained from a relapsed patient. Our analysis not only recapitulates the known hierarchical cellular composition observed in AML but suggest a new non-canonical differentiation avenue through which quiescent leukemic stem cells are able to reconstitute the terminal CD15+CD14+ blast pool, recognized as the main effectors of the disease and targets of commonly used therapeutics.
Through trajectory analysis, we show several key proteins exhibiting an asynergic expression profile across the developmental bifurcation leading to a heterogeneous blast reservoir not previously observed by measurements of mRNA transcripts alone.
In summary, in this work we demonstrate the capability of scp-MS to interrogate complex biological systems at single cell resolution. We hypothesize a new developmental model in AML awaiting further experimental validation and underline the potential towards integrative analysis employing different modalities of biological information in combination with scpMS.
Reconstructing temporal cellular dynamics from static single-cell transcriptomics remains a major challenge. Methods based on RNA velocity are useful, but interpreting their results to learn new biology remains difficult. Here we show NeuroVelo, a method that couples learning of an optimal linear projection with non-linear Neural Ordinary Differential Equations. Unlike current methods, it uses dynamical systems theory to model biological processes over time, hence NeuroVelo can identify what genes and mechanisms drive the temporal cellular dynamics. We benchmark NeuroVelo against several state-of-the-art methods using single-cell datasets, demonstrating that NeuroVelo has high predictive power but is superior to competing methods in identifying the mechanisms that drive cellular dynamics over time. We also show how we can use this method to infer gene regulatory networks that drive cell fate directly from the data.
Keynote presentation
Gastric cancer is a highly heterogeneous tumor, not only in its clinical and biological behavior but also histologically. Numerous classification attempts have been made in the past, but these have relied on histological analysis and bulk sequencing, potentially failing to fully capture the complexity of this disease. Therefore, we collected single-cell level spatial transcriptomics (scST) data from large number of surgical cases. This gastric cancer atlas is unprecedented in scale for scST analysis, incorporating genomic mutation data and H&E-stained images, enabling more comprehensive analysis.
We used a deep neural network model to extract morphological features from pathological images and quantitatively integrated the three modalities of scST, genomic mutation information, and pathological images. This allowed for a comprehensive interpretation of how tumor cells with specific gene variation affect surrounding cells, form niches, and how these are reflected in histological variation. Using this atlas, we identified characteristic niches associated with drug response and elucidated the interactions between stromal cells and tumor cells as well as distinctive histology.
This study provides a new standard for detailed tumor characterization and potentially paves the way for advanced precision medicine in gastric cancer.
While immune-based therapies have transformed the treatment of many hematologic malignancies, T cell-based immunotherapies for acute myeloid leukemia (AML) have had limited success. However, the curative potential of donor T cells in allogeneic hematopoietic cell transplantation underscores the need to understand the mechanisms behind ineffective anti-leukemic T cell activity to advance these therapies. To investigate, we performed longitudinal analysis of single cell RNA-seq on pre- and serial post-treatment samples with paired CITE-seq and VDJ sequencing for 38 patients with newly diagnosed AML. These analyses reveal the immunosuppressive nature of T cells within the AML BM, the dynamic evolution of T cell clones across activation and exhaustion states, and the immunomodulatory impact of malignant AML cells on T cells. Finally, we identified key T cell intrinsic features affecting the relationship between T cell phenotype, repertoire, and disease status in the AML BM. We hypothesize that both malignant and non-malignant cells in the AML BM contribute to impaired T cell immunity, resulting in distinct T cell compositions across different disease states.
Keynote presentation
The rapidly growing field of spatial transcriptomics has provided new means to discover cellular states within their local microenvironments, thereby revealing the spatial structures that form the basis of organ function, development, and disease pathology (Sudupe, Laura, et al., 2023). Cells that are spatially close to each other are often deemed to be functionally related, emphasizing the need to consider spatial relationships to understand cellular roles in the tissue. However, despite significant advances in spatial molecular imaging technologies, such as MERFISH and 10X Xenium, current methodologies face considerable challenges in accurately deciphering spatial domains and cell niches. These challenges stem from the need for analytical frameworks capable of integrating these data dimensions to extract meaningful biological insights.
Traditional approaches in spatial omics have often adapted clustering algorithms initially designed for single-cell RNA sequencing, that take into account only the cell transcriptomic profiles and overlook spatial information. Moreover, recent developments have introduced spatially informed algorithms, based on deep learning methods (Liu, Teng, et al., 2024; Long, Yahui, et al., 2023) and Bayesian methods (Li, Zheng, and Xiang Zhou., 2022). However, these tools encounter challenges in robustness and generalizability across various tissue types. Moreover, they usually have multiple hyperparameters that their selection remains, to a large degree, user-based and non-trivial.
In this work, we introduce DeepSpaCE, a Deep learning-based Spatial Cell Explorer model, that employs graph neural networks to address the challenge of identifying spatial domains and cell niches with high accuracy and robustness. We summarize the significance and novelties of our work in the following.
-
Leveraging a GNN-based encoder-decoder architecture and bi-objective function, DeepSpaCE takes into account both cell transcriptomic data and spatial relationships among cells.
-
DeepSpaCE requires only three hyperparameters. We either demonstrate the robustness of our pipeline to the choice of a hyperparameter or provide algorithmic solutions to find the optimum hyperparameter.
We applied DeepSpaCE on spatial single-cell data and showed the robustness of our method in accurately detecting the spatial domains and its superiority over existing methods. Additionally, we utilized DeepSpaCE to identify spatial domains in in-house kidney biopsies obtained using 10X Xenium technology. The results demonstrated the potential of DeepSpaCE to significantly enhance the accuracy of spatial domain detection in diverse tissue types, thereby providing a means to more precise and insightful spatial transcriptomic analyses.
References
Li, Zheng, and Xiang Zhou. "BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies." Genome biology 23.1 (2022): 168.
Liu, Teng, et al. "A comprehensive overview of graph neural network-based approaches to clustering for spatial transcriptomics T. Liu et al. Overview of Spatial Transcriptomics’ Spatial Clutering." Computational and Structural Biotechnology Journal (2023).
Long, Yahui, et al. "Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST." Nature Communications 14.1 (2023): 1155.
Sudupe, Laura, et al. "Spatial Transcriptomics Unveils Novel Potential Mechanisms of Disease in a MI cγ1 Multiple Myeloma in vivo Model." Blood 142 (2023): 4076.
Microscopy imaging routinely produces large datasets that capture the spatial composition and arrangement of millions of cells. Gaining biological insights from these data requires transforming images into descriptive features and integrating single cell information across datasets.
Here we introduce scPortrait, a computational framework to generate single cell representations from raw microscopy images. scPortrait solves several challenges that come with scaling image operations such as stitching and segmentation to millions of cells. Out-of-core computation enables scPortrait to efficiently handle datasets in which individual images, for example covering whole microscopy slides, exceed available memory. By introducing an open file format that interfaces with OME-NGFF and scverse spatialData, scPortrait facilitates the integration of newly recorded and publicly available datasets. To generate meaningful single cell representations, scPortrait’s standardized data format directly enables training and applying the latest deep learning-based computer vision models.
We demonstrate the utility of scPortrait on several biological use cases including phenotype identification in image-based genetic screening and single-cell representation learning.
This live coding workshop is an introduction to preprocessing and the core analyses of single-cell sequencing. We will focus on the best practices while avoiding pitfalls that are commonly missed or leftover from earlier single-cell recommendations. We will briefly cover background correction, outlier filtering, transformation, and integration. The emphasis of this workshop will be annotation, differential expression, and pathway analysis. Each attendee will have access to a dataset and optional analysis environment.
Introduction to interactive AnnData analysis with cellxgene, and interactive SpatialData analysis with Napari, vitessce and more.
A small workshop on existing projection methods for high-dimensional data such as scRNA-seq, interactive plots in python and how to create a documentation of your research that can be hosted online (including the interactive plots).
Associated to this there will be a google colab notebook, github repository and webpage for the participants as active exercise and documentation for future use.
The "Benchmarking Open Problems in Single Cell Analysis" workshop aims to address critical challenges in the field by fostering community engagement in the development of robust benchmarks. This 90-minute session will introduce participants to the core mission of the Open Problems in Single-Cell Analysis, followed by an interactive session where we will build a new benchmark from scratch. As part of this tutorial, participants not only learn about the technical aspects of setting up a benchmark within the Open Problems framework, but will also learn about essential best practices in benchmarking computational methods.
Biological data visualization is challenged by the growing complexity and size of datasets. While single-plot visualization methods struggle to capture the full picture of datasets, researchers turn to composable visualizations that are usually specialized to a domain requiring familiarity with multiple visualization tools. To unify the creation of composable visualization, we introduce a novel and intuitive general visualization paradigm termed "cross-layout,” which integrates multiple plot types in a cross-like structure. This paradigm allows for a central main plot surrounded by secondary plots, each capable of layering additional features for enhanced context and understanding. To operationalize this paradigm, we present "Marsilea", a Python library designed for creating composable visualizations with ease. Marsilea is notable for its modularity, diverse plot types, and compatibility with various data formats. This talk will bring attendees insights into composable visualizations, and they will learn how to use Marsilea to express different aspects of their single-cell or spatial omics data into a composable visualization. Marsilea is accessible to everyone with basic knowledge of Python, open-sourced at https://github.com/Marsilea-viz/marsilea.
We will use Scanpy to discuss some of the basic pre-processing of single-cell data, and transition to discuss common trajectory methods. Major emphasis of the workshop will be on guidelines for trajectory detection - things a user is recommended to keep in mind when doing such analyses. We will provide example code (with Palantir and CellRank) and illustrations. All code used will be made publicly available.
Advancements in tools and data management ecosystems have led to a mature state for collaboration across research compute efforts. The refinement of code hosting, dependency management, DevOps, documentation, and automation have established a range of workflows that remove barriers and accelerate scientific endeavors across teams and communities. Still, many of these systems place an emphasis on the management and organization of code. The complexity and importance of code base management cannot be overstated, but there remains a comparable set of challenges around handling the evolution of a shared dataset. Much like shared code, shared data can be complex, iterative, transitory, and distributed. Collaborative efforts with shared data also retain the same needs for data traceability, reproducibility, and portability. Among all available collaborative resources, a common problem stands in the way of addressing these needs - data are large.
To confront this, our group has implemented an approach to using Data Version Control (DVC) for team science with bioinformatics applications and single cell data. DVC is a data science toolset created by Iterative.ai for change tracking inputs in machine learning workflows. At its core, this resource offers a generic framework for codified data versioning and provenance which sits on top of well-established tools. This key element of the DVC paradigm allows for the separation of data from metadata in change tracking. Object-linked metadata are managed through git while objects can be stored in a separate remote storage. This work offers a guide of the DVC framework that is intended to service requirements specific to the needs of bioinformatics and single cell data. We show how to tailor tracking of data in a manner that can efficiently address challenges around the diverse set of intermediary single cell outputs. We also place a particular emphasis on the need to bridge gaps in data hand-offs that may exist between hybrid collaborations of dry and wet lab scientists. Altogether, we hope this application of a lesser-known toolset offers an on ramp to the adoption of stronger data management practices, enabling collaboration across single cell biology and beyond.
An overview on how do to your first contribution to scverse community, even if you are not a developer including: replying on discourse queries, contributing to documentation, best practices book, posting and using issues even for problems you have solved, where to find the contributing guidelines
As the field of single-cell RNA sequencing continues to evolve, researchers are increasingly interested in using these datasets to train foundational models for a wide range of applications. While training models on smaller datasets that fit into memory is relatively straightforward, scaling up beyond single machines presents significant technical challenges. In this workshop co-presented by TileDB and the Chan Zuckerberg Initiative, participants will receive hands-on experience training models on a large dataset comprising 70 million cells and learn about the key technologies and resources that make this possible.
Keynote presentation
Keynote presentation
Imaging-based spatial omics datasets present challenges in reliably segmenting single cells. Achieving accurate segmentation at single-cell resolution is crucial to unravelling multicellular mechanisms and understanding cell-cell communications in spatial omics studies. Despite the considerable progress and the variety of methods available for cell segmentation, challenges persist, including issues of over-segmentation, under-segmentation, and contamination from neighbouring cells. While combining multiple segmentation methods with distinct advantages has been proposed, it does not completely resolve these issues. Additionally, scalability remains an obstacle, particularly when applying these methods to larger tissues and gene panels in targeted studies.
Here we introduce Segger, a cell segmentation model designed for single-molecule resolved datasets, leveraging the co-occurrence of nucleic and cytoplasmic molecules (e.g., transcripts). It employs a heterogeneous graph structure on molecules and nuclei, integrating fixed-radius nearest neighbor graphs for nuclei and molecules, along with edges connecting transcripts to nuclei based on spatial proximity. A heterogeneous graph neural network (GNN) is then used to propagate information across these edges to learn the association of molecules with nuclei. Post-training, the model refines cell borders by regrouping transcripts based on confidence levels, overcoming issues like nucleus-less cells or overlapping cells. Benchmarks on 10X Xenium and MERSCOPE technologies reveal Segger's superiority in accuracy and efficiency over other segmentation methods, such as Baysor, Cellpose, BIDcell, and simple nuclei-expansion. Segger can be pre-trained on one or more datasets and fine-tuned with new data, even acquired via different technologies. Its highly parallelizable nature allows for efficient training across multiple GPU machines, facilitated by recent graph learning techniques. Compared to other model-based methods like Baysor and BIDCell Segger's training is orders of magnitude faster, and more accurate making it ideal for integration into preprocessing pipelines for comprehensive spatial omics atlases.
Spatial transcriptomics is revolutionizing the field of molecular biology by providing high-resolution insights into gene expression within the spatial context of tissues. This technology is crucial for annotating spatial domains (or niches), which allows researchers to understand the spatial organization of gene expression and its implications for tissue function and disease progression. Although many studies have already explored this area, current models lack versatility as they cannot run on different gene panels and must be retrained for each new task or sample.
We introduce Novae, a graph-based foundation model for spatial omics designed to overcome these limitations. Novae is a self-supervised model focused on extracting a representation of a cell within its niche context. Trained on a dataset of 21 million cells across 12 different tissues, Novae performs zero-shot inference on all gene panels, automatically correcting batch effects and creating a nested hierarchy of niches. It also supports various downstream tasks, including whole-genome expression inference, spatially variable gene analysis, and niche trajectory analysis. Overall, Novae offers a powerful and versatile tool for advancing our understanding of spatial transcriptomics and its applications in biomedical research.
Learn how to incorporate NVIDIA GPU acceleration in your day-to-day data analysis. This workshop will demonstrate how to use rapids-singlecell to accelerate traditional Scanpy and other scverse workflows. Attendees will gain practical insights into leveraging GPUs outside of deep learning, exploring the capabilities of RAPIDS and CuPy. Discover how these tools can enhance your package and improve your single-cell data analysis efficiency.
Introducing the cookiecutter template and how to use it. We should also introduce the scverse ecosystem and how to get included in it.
This goal of this workshop is to demonstrate how spatial transcriptomics data can be used to extract subcellular insights about cell biology, primarily emphasizing analysis with Bento – a package in the Scverse ecosystem – that utilizes the SpatialData framework enabling interoperability with Scanpy and Squidpy. The workshop will briefly introduce motivations and applications of spatial transcriptomics technologies for subcellular measurements. Then we will walk through how Bento models single-molecule resolution data, computes spatial features, and finally how they are utilized for functional analysis i.e. annotating subcellular spatial patterns and domains, modeling gene colocalization, and measuring cell morphology. We will conclude with a short Q&A session for more open-ended discussion.
As both individual datasets and atlases grow, so to should the capacities of our data structures. Listen in to find out how AnnData is addressing big data, and preparing for the future.
In order to use the best performing methods for each step of the single-cell analysis process, bioinformaticians need to use multiple ecosystems and programming languages. This is unfortunately not that straightforward. We will give an overview of the different levels of interoperability, and how it is possible to integrate them in a single workflow.
For package developers, making methods accessible is important. We will provide information on how to do this well on the package and method level.
Closing remarks