Data Version Control: the missing link in team science scverse Conference 2024

Data Version Control: the missing link in team science
.ical

09-12, 09:00–10:00 (Europe/Berlin), Workshop room II - MW 0234

Advancements in tools and data management ecosystems have led to a mature state for collaboration across research compute efforts. The refinement of code hosting, dependency management, DevOps, documentation, and automation have established a range of workflows that remove barriers and accelerate scientific endeavors across teams and communities. Still, many of these systems place an emphasis on the management and organization of code. The complexity and importance of code base management cannot be overstated, but there remains a comparable set of challenges around handling the evolution of a shared dataset. Much like shared code, shared data can be complex, iterative, transitory, and distributed. Collaborative efforts with shared data also retain the same needs for data traceability, reproducibility, and portability. Among all available collaborative resources, a common problem stands in the way of addressing these needs - data are large.

To confront this, our group has implemented an approach to using Data Version Control (DVC) for team science with bioinformatics applications and single cell data. DVC is a data science toolset created by Iterative.ai for change tracking inputs in machine learning workflows. At its core, this resource offers a generic framework for codified data versioning and provenance which sits on top of well-established tools. This key element of the DVC paradigm allows for the separation of data from metadata in change tracking. Object-linked metadata are managed through git while objects can be stored in a separate remote storage. This work offers a guide of the DVC framework that is intended to service requirements specific to the needs of bioinformatics and single cell data. We show how to tailor tracking of data in a manner that can efficiently address challenges around the diverse set of intermediary single cell outputs. We also place a particular emphasis on the need to bridge gaps in data hand-offs that may exist between hybrid collaborations of dry and wet lab scientists. Altogether, we hope this application of a lesser-known toolset offers an on ramp to the adoption of stronger data management practices, enabling collaboration across single cell biology and beyond.

An introduction to the use of Data Version Control (DVC) as a solution for managing shared datasets in collaborative bioinformatics research. We demonstrate how DVC's data versioning and provenance features can bridge gaps in data hand-offs between wet and dry lab scientists -promoting traceability, reproducibility, and portability in team science.

Prior Knowledge Expected –

Previous knowledge expected

Chris Tastad

Senior Manager, Informatics
Cho Lab
Pathology, Molecular and Cell Based Medicine
Icahn School of Medicine at Mount Sinai

This speaker also appears in:

SCleeStacks: modality-organized images for containerized development, execution, and analysis

scverse Conference 2024

Data Version Control: the missing link in team science .ical 09-12, 09:00–10:00 (Europe/Berlin), Workshop room II - MW 0234

Data Version Control: the missing link in team science
.ical

09-12, 09:00–10:00 (Europe/Berlin), Workshop room II - MW 0234