2023 FDA Science Forum
Predicting AI model behavior on unrepresented subgroups: A test-time approach to increase variability in a finite test set
- Authors:
- Center:
-
Contributing OfficeCenter for Devices and Radiological Health
Abstract
Background:
Artificial intelligence (AI) models typically do not use patient information such as sex/race during decision making to avoid disparate treatment. Despite this, AI models have shown a systematic difference in performance between sex/race subgroups even when trained without accessing these attributes. The causes for the systematic differences can be broadly divided into two categories: biological differences (male vs female, white vs black) and/or patient-specific characteristics such as disease severity, underlying conditions, or other factors dependent on the data collection or model development practices. The degree to which these biological differences influence the systematic performance differences is not well understood.
Purpose:
Using the model confidence measurements for the classification of COVID-19 status on chest-x-ray (CXR) data, we assessed if the confounding relationship between the patient subgroups and the model confidence is caused by the biological differences or only patient-specific characteristics.
Methodology:
We developed a novel approach to assess the effects of variation of characteristics within (base) and across (mixed) subgroups on model confidence. Linear interpolation along a convex hull between CXR samples is used to create virtual samples containing a mixture of characteristics from the original samples, simulating a previously unseen combination of characteristics in a series of systematic experiments. We assume that if biological differences are a confounder in the model classification, then mixing characteristics from different base subgroups should decrease model confidence. However, if performance differences are instead due to patient-specific characteristics, the resulting mixed confidence should stay in between the confidence measurements of the base subgroups.
Results:
All mixed subgroups either resulted in an average of the confidence measurements of the base subgroups, or the differences between the mixed and base subgroups were small for the model studied. No mixed subgroups resulted in dramatic decreases in model confidence, indicating biological differences from sex and/or race were not a confounding factor.
Conclusion:
In this study, we show that the effects of the patient-specific characteristics within subgroups are consistently greater than the biological differences between subgroups on the systematic differences in model confidence for this clinical task, models, and data used in the model development.