U.S. flag An official website of the United States government

On Oct. 1, 2024, the FDA began implementing a reorganization impacting many parts of the agency. We are in the process of updating FDA.gov content to reflect these changes.

  1. Home
  2. Medical Devices
  3. Science and Research | Medical Devices
  4. Medical Device Regulatory Science Research Programs Conducted by OSEL
  5. Performance Evaluation Methods for Evolving Artificial Intelligence (AI)-Enabled Medical Devices
  1. Medical Device Regulatory Science Research Programs Conducted by OSEL

Performance Evaluation Methods for Evolving Artificial Intelligence (AI)-Enabled Medical Devices

As part of the Artificial Intelligence (AI) Program in the FDA’s Center for Devices and Radiological Health (CDRH), the goal of this regulatory science research is to develop methods for performance evaluation of model updates for artificial intelligence/machine learning (AI/ML)-enabled devices.

Overview

On March 30, 2023, the FDA’s Center for Devices and Radiological Health (CDRH) published the draft guidance document: Marketing Submission Recommendations for a Predetermined Change Control Plan (PCCP) for Artificial Intelligence/Machine Learning (AI)-Enabled Device Software Functions. This draft guidance aims to enable device manufacturers to include a plan in an FDA submission so that the device can evolve within controlled boundaries while on the market. This approach is expected to enable manufacturers to make modifications and updates more easily to their devices, while also maintaining the FDA’s ability to assure continued device safety and effectiveness. While the draft guidance outlines a sound approach, there are areas in the premarket evaluation of devices with PCCPs that require further technical analysis for a least burdensome path to the market. 

Well-curated, labeled, and representative datasets in medical applications are difficult and resource-intensive to collect, so device sponsors naturally wish to reuse their test datasets in evaluating their devices with PCCPs. However, repeatedly using the same test dataset when testing a sequence of AI model updates can be problematic, because the AI model can end up overfitting to the test dataset. If this happens, the performance evaluation will yield misleading, overly optimistic results, and the models will fail to generalize to new data. There is a need for methods to re-use evaluation datasets safely for devices with a PCCP. Other knowledge gaps in this area include the implications of potential changes to the reference standard, how much change is acceptable to maintain an appropriate benefit/risk profile, and how to balance plasticity/stability of continuously learning models. 

The goal of this effort is to address issues by:

  • Developing statistical methods and theoretical results as well as performing empirical experiments and studies.
  • Releasing regulatory science tools that can be used to design studies that will continuously measure performance for evolving algorithms under a postmarket assurance plan.

Project

  • Develop Methods for Performance Evaluation of Model Updates for AI/ML-Enabled Devices with a PCCP
Different machine learning algorithms (corresponding to the five columns of panels in the figure) are modified and re-trained in each “round of adaptivity,” and subsequently tested repeatedly on the same relatively small test dataset.
Different machine learning algorithms (corresponding to the five columns of panels in the figure) are modified and re-trained in each “round of adaptivity,” and subsequently tested repeatedly on the same relatively small test dataset. The row of panels titled “Naive test data reuse” shows that, as the test dataset is used repeatedly in this fashion, the measured performance metric Receiver Operating Characteristics Curve - Area Under the ROC Curve (ROC AUC) on the test data becomes substantially inflated compared to the true performance on the population from which the fixed test dataset was drawn. The second row of panels shows that when the proposed method Thresholdout is used for test data access, the bias in the measured performance values is reduced at the cost of higher uncertainty in the reported performance estimates. For additional details please see Gossmann et al. (2021).

Resources

  • Feng, J., Pennello, G., Petrick, N., Sahiner, B., Pirracchio, R., & Gossmann, A. (2022). Sequential Algorithmic Modification with Test Data Reuse. Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, 674–684.
  • A. Burgon, B. Sahiner, N. Petrick, G. Pennello, R. K. Samala, “Methods for improved understanding of evolving AI model learning and knowledge retention across sequential modification steps,” RSNA Program Book (2023).
  • Gossmann, A. (2022). Test Data Reuse for the Evaluation of Continuously Evolving Machine Learning Algorithms in Medicine. Invited talk at the Tutorial on AI for medical image analysis in practice at MICCAI 2022. September 21, 2022.

For more information, email OSEL_AI@fda.hhs.gov.

Back to Top