Felix Petschko
Felix Petschko is currently a Master's student in Computer Science at the University of Innsbruck. He is conducting his Master's thesis research within the Computational Biomedicine Group at the Institute of Molecular Biology and the Digital Science Center (DiSC) at the University of Innsbruck, Austria.

Sessions
Scirpy is a scverse core package for single-cell analysis of adaptive immune receptor repertoires (AIRR) in Python. A key functionality is the “clonotype clustering” step that aims to identify immune cells likely to recognize the same antigen or have a common ancestor. As dataset sizes increased to millions of cells thanks to improvements in single-cell technologies, clonotype clustering can become a bottleneck in the analysis pipeline. This project aimed at making it feasible to analyze large-scale, single-cell AIRR datasets with scirpy.
There are two crucial steps to clonotype clustering: 1) computing distances between complementarity-determining region 3 (CDR3) sequences based on a similarity metric, e.g. Hamming distance; 2) identifying clusters based on the receptor configurations, i.e. identity/similarity between single-cells in terms of CDR3 sequences and V/J gene usage.
We integrated tcrdist3, a runtime-optimized implementation of TCRdist metric, as an alternative CDR3-sequence similarity metric. The original implementation uses the Numba library for just-in-time compilation of Python code. To overcome previous limitations with large datasets, we adapted the original tcrdist3 implementation to operate with sparse matrices. We improved the performance of the Hamming distance calculation in scirpy with a similar Numba implementation. For the identification of the clusters, we reimplemented the existing algorithms optimizing sparse matrix operations.
For 1 million T cells with 64 CPU cores, the improved version of the Hamming distance calculation achieved a speedup of ~20.8 (43 seconds vs. 893 seconds). This calculation took 210 seconds with the TCRdist method. Tested with only 140 thousand T cells due to memory constraints of the old version, the improved version of the cluster identification achieved a speedup of ~29.9 (8 seconds vs. 239 seconds) with 8 CPU cores. We successfully tested the improved version on up to 8 million T cells, which was infeasible using the previous implementations due to memory and/or runtime constraints.
These enhancements significantly improve scirpy’s capability to analyze large-scale single-cell RNA sequencing data, facilitating more efficient and scalable clonotype clustering. Future work to increase scirpy’s performance will focus on providing GPU support.