scverse Conference 2024

Training models on atlas-scale single-cell datasets
09-12, 09:00–10:00 (Europe/Berlin), Main conference room - MW 0350

As the field of single-cell RNA sequencing continues to evolve, researchers are increasingly interested in using these datasets to train foundational models for a wide range of applications. While training models on smaller datasets that fit into memory is relatively straightforward, scaling up beyond single machines presents significant technical challenges. In this workshop co-presented by TileDB and the Chan Zuckerberg Initiative, participants will receive hands-on experience training models on a large dataset comprising 70 million cells and learn about the key technologies and resources that make this possible.


Participants should have some familiarity with AnnData and PyTorch. However, no prior experience with TileDB-SOMA is required. We will cover the following topics:

Section 1: CZ CELLxGENE Discover Census
The world's largest public resource providing standardized single-cell data to researchers worldwide.
Section 2: TileDB
The open source data format and storage engine that enables efficient indexing and retrieval of large datasets stored on remote object stores like AWS S3.
Section 3: SOMA
A language agnostic data model and API specifically designed for storing and querying single-cell data using TileDB's format.
Section 4: SOMA/Census PyTorch Loaders
Specialized loaders for PyTorch modeling optimized for memory-efficient training via TileDB-SOMA's support for out-of-core data access.

By the end of the session, attendees will have a comprehensive understanding of how to access and train models on large single-cell datasets stored as SOMA experiments.

Instructions for Workshop Participants:

Setup your TileDB Cloud account and join the workshop namespace:

  1. Share your email with the TileDB team to be added to the scverse-ml-workshop-2024 organization on TileDB Cloud via this google form https://forms.gle/xRDsZDh5nBLpQ3gQ9 or by emailing [email protected].
  2. Accept the emailed invitation to join the organization. You'll be guided through signing up for a TileDB-Cloud account if you have not already done so.
  3. Once logged in, make sure you are in the "scverse-ml-workshop-2024" namespace, in the upper left corner.

Prior Knowledge Expected

Previous knowledge expected

See also: First stable iteration of Census (SOMA) PyTorch loaders

Ryan is a Staff Software Engineer at TileDB, focused on scalable scRNA-seq data processing and storage using TileDB-SOMA. He has previously built software for distributed analysis of bulk and single-cell genomic data, in industry and as part of a CZI Human Cell Atlas grant at Mount Sinai School of Medicine, and holds a BSc in Mathematical & Computational Sciences from Stanford University.

Maximilian Lombardo is a Senior Product Applications Scientist at the Chan Zuckerberg Initiative, where he collaborates with the CELLxGENE team to engage the single-cell community and enhance the adoption of CELLxGENE tools. In his role, he focuses on educating users and driving the development of resources that support innovative single-cell research. Previously, Maximilian was a Data Scientist at Kallyope, contributing to the development of a gut-brain axis target discovery platform. His work involved integrating single-cell data with viral tracing techniques to identify novel therapeutic targets. Maximilian holds an MSc in Computational Science from the University of Amsterdam and an MA in Biotechnology from Columbia University.