Associate Research Scientist Yale University New Haven, Connecticut, United States
Introduction/Rationale: Sequencing the Adaptive Immune Receptor Repertoire (AIRR) allows the characterization of immune states in health and disease, including infectious diseases, (auto)immune diseases, and cancer. Although numerous tools exist for reconstructing B and T cell receptor (BCR and TCR) sequences and inferring clonal relationships from AIRR sequencing (AIRR-seq) data, many lack scalability, efficient sample-level parallelization, or portability to high-performance computing environments. We addressed these limitations with nf-core/airrflow (https://nf-co.re/airrflow), a scalable and reproducible Nextflow-based workflow for processing bulk and single-cell AIRR-seq data. Since its implementation, we have expanded the workflow with new functionality including BCR and TCR sequence embedding using large-language models (LLM), immunoglobulin (IG) loci genotyping and support for the AIRR community germline references.
Methods: nf-core/airrflow integrates tools from the Immcantation Framework (immcantation.org) following BCR and TCR data analysis best practices. We recently expanded the workflow to include LLM sequence embedding with AMULETY, as well as IG loci genotyping and novel allele detection using TIgGER. We additionally provide support for the newly released AIRR Community germline reference datasets hosted in the Open Germline Receptor Database (OGRDB).
Results: We demonstrate the applicability of nf-core/airrflow by genotyping and generating embeddings of publicly available BCR sequencing datasets from individuals with autoimmune diseases, including systemic lupus erythematosus, type 1 diabetes and rheumatoid arthritis.
Conclusion: nf-core/airrflow is a comprehensive and scalable workflow for AIRR-seq data analysis, enabling a wide range of applications in immune mediated and infectious disease research and supporting the reproducible analysis of increasingly large AIRR-seq datasets.