Using Machine Learning To Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella.
Academic Article
Overview
abstract
Nontyphoidal Salmonella species are the leading bacterial cause of foodborne disease in the United States. Whole-genome sequences and paired antimicrobial susceptibility data are available for Salmonella strains because of surveillance efforts from public health agencies. In this study, a collection of 5,278 nontyphoidal Salmonella genomes, collected over 15 years in the United States, was used to generate extreme gradient boosting (XGBoost)-based machine learning models for predicting MICs for 15 antibiotics. The MIC prediction models had an overall average accuracy of 95% within ±1 2-fold dilution step (confidence interval, 95% to 95%), an average very major error rate of 2.7% (confidence interval, 2.4% to 3.0%), and an average major error rate of 0.1% (confidence interval, 0.1% to 0.2%). The model predicted MICs with no a priori information about the underlying gene content or resistance phenotypes of the strains. By selecting diverse genomes for the training sets, we show that highly accurate MIC prediction models can be generated with less than 500 genomes. We also show that our approach for predicting MICs is stable over time, despite annual fluctuations in antimicrobial resistance gene content in the sampled genomes. Finally, using feature selection, we explore the important genomic regions identified by the models for predicting MICs. To date, this is one of the largest MIC modeling studies to be published. Our strategy for developing whole-genome sequence-based models for surveillance and clinical diagnostics can be readily applied to other important human pathogens.