Chris Tastad scverse Conference 2024

Chris Tastad
.ical

Senior Manager, Informatics
Cho Lab
Pathology, Molecular and Cell Based Medicine
Icahn School of Medicine at Mount Sinai

Sessions

09-10

15:01

3min

SCleeStacks: modality-organized images for containerized development, execution, and analysis

Chris Tastad

Containerization has fueled a movement around virtualization that has impacted nearly the entire landscape of software development ecosystems. The ability to enable simple isolation of both environment and tool within a portable, lightweight package has completely reshaped many aspects of the software management paradigm. This overhaul is not limited to production workflows as low-friction virtualization has created established use cases for containerized management across development, execution, testing, interactive compute, distribution, and deployment. Even with these advancements, utilization of these conventions remains limited in some areas of research computing and bioinformatics, especially single cell biology.

Here, we present SCleeStacks - an effort to expand the accessibility of containerized workflows in the scverse. The goals of this project are to: 1) standardize and simplify the containerization and dependency landscape across the scverse ecosystem; 2) lower barriers to scverse environment sharing across teams and communities; 3) promote improvements to research software methods, publishing, and reproducibility; 4) foster upstream compatibility for development and execution of interdependent tool combinations; and 5) expand opportunities for use of scverse tools in higher-order execution frameworks such as Nextflow and Snakemake.

The central outcome in achieving these goals is the establishment of a repository of maintained, tool-level images across scverse ecosystem packages. As many uses of scverse tools focus on multi-modal data that require an intersection of package functions, a class of these images attempt to offer multi-layered tool stacks built around common workflows or data type modalities. Image structure follows common best practices in provisioning for security, performance, and compatibility through the Docker framework. Singularity is embraced for use in non-root or restricted HPC environments. A critical element of this implementation is a common design language to foster both contributions to the creation of images as well as their use. We expect a common application to be interactive computing during analysis and pipeline development. To address this, a specific emphasis is placed on documentation around workflows like IDE dev containers in tools such as VS Code, JetBrains, or Jupyter Notebooks. Where possible, authors hope to sustain an open dialogue with upstream tool maintainers and the wider community, striving for collaboration that can elevate tool harmony and integration. Much like any community development effort, we expect the evolution of this resource to be guided by the needs of users and the field. With that, we hope the efforts behind this project offer something of value and can serve the growing needs of the scverse community.

Poster flash talk

Main conference room - MW 0350

09-12

09:00

60min

Data Version Control: the missing link in team science

Chris Tastad

Advancements in tools and data management ecosystems have led to a mature state for collaboration across research compute efforts. The refinement of code hosting, dependency management, DevOps, documentation, and automation have established a range of workflows that remove barriers and accelerate scientific endeavors across teams and communities. Still, many of these systems place an emphasis on the management and organization of code. The complexity and importance of code base management cannot be overstated, but there remains a comparable set of challenges around handling the evolution of a shared dataset. Much like shared code, shared data can be complex, iterative, transitory, and distributed. Collaborative efforts with shared data also retain the same needs for data traceability, reproducibility, and portability. Among all available collaborative resources, a common problem stands in the way of addressing these needs - data are large.

To confront this, our group has implemented an approach to using Data Version Control (DVC) for team science with bioinformatics applications and single cell data. DVC is a data science toolset created by Iterative.ai for change tracking inputs in machine learning workflows. At its core, this resource offers a generic framework for codified data versioning and provenance which sits on top of well-established tools. This key element of the DVC paradigm allows for the separation of data from metadata in change tracking. Object-linked metadata are managed through git while objects can be stored in a separate remote storage. This work offers a guide of the DVC framework that is intended to service requirements specific to the needs of bioinformatics and single cell data. We show how to tailor tracking of data in a manner that can efficiently address challenges around the diverse set of intermediary single cell outputs. We also place a particular emphasis on the need to bridge gaps in data hand-offs that may exist between hybrid collaborations of dry and wet lab scientists. Altogether, we hope this application of a lesser-known toolset offers an on ramp to the adoption of stronger data management practices, enabling collaboration across single cell biology and beyond.

Workshops

Workshop room II - MW 0234

scverse Conference 2024

Chris Tastad .ical

Sessions

Chris Tastad
.ical