Computational Research Scientist University of Texas Southwestern Medical Center Dallas, Texas, United States
Introduction/Rationale: Building a knowledgebase that integrates multiple data repositories requires the concepts, relationships, and data schemas/formats to be harmonized across those repositories. One technique is designing a common data model (CDM) that encompasses all concepts and relationships, transforming the data into the CDM, and utilizing ontologies to provide shared semantics.
Methods: The Adaptive Immune Receptor Repertoire Knowledge Commons (AKC) is a publicly accessible repository that integrates data and knowledge about 1) adaptive immune receptors (AIRs) and AIR repertoires from the AIRR Data Commons, 2) AIR germline allele, genotype, haplotype, and population genetic data from the OGRDB and VDJbase, and 3) AIR specificity data from the IEDB and IRAD. We designed a CDM for the AKC based upon the Ontology for Biomedical Investigations, a community standard for scientific data integration, and we used the LinkML data modeling language for implementation.
Results: The AKC provides a consistent CDM for study, subject, and sample information; immune exposures and other study events; sample collection; assays and processing; and data processing and analysis workflows. The CDM also provides adaptive immunity domain knowledge for chains, receptors, antigens, epitopes, and MHC/HLA. LinkML’s flexible data modeling language allows for existing data standards, such as the AIRR Standards, and ontologies from the OBO Foundry to be directly incorporated.
Conclusion: The foundation of the AKC is data integrated from these community-supported repositories and harmonized around a CDM based on widely adopted ontologies and data standards. The AKC assembles the critical mass of data required to develop highly accurate predictive algorithms for long-standing questions of critical importance (e.g., predicting AIR specificity, determining the contribution of AIR germline polymorphisms to disease propensity) and to ask questions across a large and diverse set of subjects with a variety of health and disease phenotypes.