2021 FDA Science Forum
Standardizing the Isolation Source Metadata for the Genomic Epidemiology of Foodborne Pathogens Using LexMapr
- Authors:
- Center:
-
Contributing OfficeCenter for Food Safety and Applied Nutrition
Abstract
Introduction
FDA’s GenomeTrakr is a public/private genomic epidemiology network for foodborne pathogen surveillance, specifically targeting pathogens isolated from food or environmental sources. The raw genome plus a small set of associated metadata are made publicly available at the National Center for Biotechnology Information (NCBI). Metadata include organism name, geographical location, collection date, isolate contributor and isolation source. The isolation source field is currently a free text field, requiring no standard terminologies or structure. As the GenomeTrakr database grew to over 100K isolates and the diversity of isolation sources became more complex, this field became difficult to analyze and interpret using computational approaches.
Purpose
In order to maximize the use of GenomeTrakr data and make this resource FAIR (findable, accessible, interoperable and reusable), we have standardized the metadata for the isolation source of WGS data for publicly available GenomeTrakr records.
Methods
We evaluated and utilized LexMapr, a rule-based text-mining tool, to automate the curation of isolation source metadata and assign categories from the expanded source categorization schema Interagency Food Safety Analytics Collaboration (IFSAC+) based on IFSAC categories. LexMapr processes the text from the isolation source and extracts entities incorporate new standard descriptors for the isolation sour that are mapped to standard ontology terms from relevant ontologies such as: FoodON, ENVO, UBERON, among others.
Results
GenomeTrakr has a total of 9,452 unique isolation sources. LexMapr successfully processed 88% of these records, as determined by manual curation and verification. After the evaluation of LexMapr, 71,886 publicly available records were curated, assigned ontology terms, and categorized using the IFSAC+ categorization schema.
Significance
The use of standard terminologies in the context of metadata for WGS is essential to facilitate data exchange and generate machine-readable resources that can expand our understanding of the dynamics of pathogen transmission across the food chain.