RADIOLOGICAL DEVICES ADVISORY PANEL
February 3, 2004
Radiological Devices Advisory Panel Meeting
February 3, 2004
M.D. Anderson Cancer Center
Minesh P. Mehta, M.D.
University of Wisconsin-Madison
(participating by conference call)
Prabhakar Tripuraneni, M.D.
Scripps Clinic Green Hospital
Emily F. Conant, M.D.
University of Pennsylvania School of Medicine
Temporary Voting Members
Brent Blumenstein, Ph.D.
Thomas Ferguson, M.D.
Washington University School of Medicine
Elizabeth A. Krupinski, Ph.D.
University of Arizona
Stephen Solomon, M.D.
Johns Hopkins Hospital
David Stark, M.D.
Downstate Medical Center (formerly of)
Nonvoting Consumer Representative
Charles Barry Burns, M.S.
University of North Carolina School of Medicine
Deborah J. Moore, B.S.
Proxima Therapeutics, Inc.
Director, Division of Reproductive, Abdominal, and Radiological Devices
Robert J. Doyle
Panel Executive Secretary
Robert A. Phillips, Ph.D.
Chief, Radiological Devices Branch
William Sacks, Ph.D., M.D.
Nicholas Petrick, Ph.D.
Office of Science and Technology
Robert F. Wagner, Ph.D.
Office of Science and Technology
Acting Panel Chair Geoffrey S. Ibbott, Ph.D., called the meeting to order at 9:06 a.m. He noted for the record that the voting members present constituted a quorum and asked the panel members to introduce themselves.
Panel Executive Secretary Robert Doyle noted that Dr. Ibbott had been appointed acting chair for the duration of the meeting and read the appointment to temporary chair status. He also read the appointment of Brent Blumenstein, Ph.D., Thomas Ferguson, M.D., Elizabeth A. Krupinski, Ph.D., Stephen Solomon, M.D., and David Stark, M.D., to temporary voting status for the duration of the meeting. Mr. Doyle then read the conflict of interest statement and stated that the agency had no conflicts to report. Finally, he noted that Radiological Devices Advisory Panel meetings have been tentatively scheduled for May 18, August 10, and November 16 of this year.
Nancy Brogdon thanked Ernest Stern, industry representative; Wendy Berg; and Harry Genant for their service; they all had completed 4-year terms on the panel at the end of January.
Robert A. Phillips, Ph.D., Chief, Radiological Devices Branch, observed that few PMAs had come before the panel or are expected to come before the panel. No PMAs have been approved in the past year. Four new reviewers have joined the division: Nancy Wersto, Kish Chakrabarti, Barbara Shoback, and Sophie Paquerault.
No one asked to speak.
Robert F. Wagner, Ph.D., OST, CDRH, FDA, presented an overview of contemporary ROC methodology in medical imaging and computer-assist modalities. ROC refers to receiver operating characteristic or relative operating characteristic; some people prefer the terminology operating characteristic. The imaging field uses the first meaning.
No FDA guidance has been issued on how to use the concepts of sensitivity, specificity, and ROC analysis to assess performance of diagnostic and computer assist systems. Dr. Wagner summarized efforts toward consensus development on the issues and outlined the fundamentals and limitations of the ROC paradigm.
Kathy O’Shaughnessy, Ph.D., vice president, regulatory affairs, R2 Technology, introduced the R2 attendees and consultants. She noted that the ImageChecker CT is a computer-aided detection (CAD) system designed to assist radiologists in the detection of pulmonary nodules during review of multidetector computed tomography (MDCT) scans of the chest. It is intended to be used as a “second reader,” alerting the radiologist after his or her initial reading of the scan to regions of interest that may have been initially overlooked.
Heber MacMahon, M.D., consultant to R2, noted that lung nodule detection is a requirement for every chest CT exam, no matter what the original clinical indication is. The radiologist must first detect a nodule, then decide what action is appropriate. Nodule management strategies require first determining whether or not a nodule is “actionable” (a determination based on many factors, including size, morphology, and clinical history), then recommending the appropriate action. Actions can include additional imaging studies, biopsy or resection. Research findings indicate that among patients with lung disease, initial nodule miss rates range from 24 to 39 percent. The trend toward thinner CT sections offers improved ability to detect and characterize lesions; however, increases in the number of images means that more opportunities for misses exist. Visual scrutiny alone is no longer reliable enough.
Ronald A. Castellino, M.D., Chief Medical Officer, R2Technology, defined CAD as computer algorithms that automatically identify regions of interest on a medical image for radiologists to evaluate. The goal is to decrease “observational oversights”—i.e., false negatives (FNs). The Image Checker CT automatically detects regions of interest in chest CT exams with features suggestive of solid pulmonary nodules. It is intended to be used as a supplemental review, after an initial assessment has been made. The radiologist is responsible for the final interpretation of the image data.
Dr. Castellino described the device, presented a diagram of how it works, and showed a video clip demonstrating operation of the device. He noted that the only component under review is the CAD software component because the other components have already been cleared. The software is designed to search for solid lung nodules that are at least 4 mm in diameter. The nodules should have an approximately spherical shape; a smooth, lobulated, or speculated margin; and soft tissue density of at least –100 Hounsfield units (HU). Other lung parenchymal densities, such as linear strands and ground glass opacities, are not targeted. Each exam provides a median of 160 images and a median of three false positive (FP) marks. Most FP marks can be readily dismissed.
The clinical study was designed around an ROC study and produced a combined measure of efficacy and safety; it was designed in cooperation with FDA staff. The study involved three main components: case collection; a reference truth panel; and a multireader, multicenter (MRMC) ROC study. MDCT is designed to help detect all solid nodules between 4 and 30 mm. Lung nodules are typically at least 10 mm in diameter when biopsied or resected. Consensus on actionability is the only practical standard that captures all solid nodules of clinical concern.
Five centers contributed consecutive, nonselected MDCT chest exams. Patients consisted of adults at least 18 years old who had a variety of clinical indications. Cases with more than 10 nodules were excluded. Contiguous slices of ≤ 3.0 mm collimation were provided. Five sites contributed a total of 63 nodule-present cases (148 slices) and 88 nodule-absent cases (176 slices). Nodule-present cases (by report) were those of patients with documented lung or extrathoracic cancer; however, the nodules themselves were not proven by biopsy to be cancer. Nodule-absent cases (by report) were those of patients with or without documented cancer. The final “truth” was determined by the truthing panel. Patients’ demographics were comparable across the two types of cases, although the nodule-present cases had a median age of 66 and nodule-absent cases had a median age of 55.
Dave Miller, Director, Statistical Analysis, Ovation Research Group, presented the clinical study methodology and results. The goal of the reference truth panel was to fully identify all nodules in the case set, rate their actionability (i.e., surveillance or intervention), and define the reference truth for the ROC study. Eight panels consisting of three board-certified radiologists independently reviewed the cases; two passes were made to reduce observational oversights. A total of 11 radiologists participated in at least one of the panels. All participants had at least 6 months experience in reading thin-slice chest CT images.
Each radiologist independently reviewed a set of cases. The sponsor compiled the radiologists’ findings to determine the cases for which they had a consensus. Nodules smaller than 4 mm and greater than 30 mm were excluded. That process yielded 95 nodules on which there was unanimity as to actionability. For cases in which the readers did not have consensus, the reviewers reexamined the images of the areas of disagreement. That second pass yielded 47 consensus images, for a total of 142 consensus nodules in 65 nodule-present cases. The remaining 86 cases were categorized as nodule absent.
The goal of the MRMC study was to demonstrate that review of CAD output improves the performance of radiologists reviewing MDCT with respect to their ability to accurately find and identify actionable nodules. The outcome measure was the difference in the number of nodules deemed actionable before and after use of the ImageChecker (i.e., the change in the area under the ROC curve, or AzΔ). The study was conduced with board-certified radiologists who had at least 3 months experience reading chest MDCT. Fifteen readers read 90 cases (48 with at least one actionable nodule, 42 without any actionable nodules) both pre-CAD and post-CAD. Each case was divided into four quadrants. Ratings were evaluated against the reference truth.
Nodules were the unit of analysis. The quadrant truth was computed from the nodule truth. The sponsor used a quadrant approach because location-specific ROC methods have not been developed for MRMC. Quadrants were rated by ROC readers. The case, not the quadrant, is the unit of analysis for p values and confidence intervals.
Readers practiced on three cases with a trainer present. Ambient lighting was adjusted to radiologist preference. Readers were instructed to search only for 4 to 30 mm nodules; to rate each post-CAD case immediately; and to consider age, gender, and clinical indication from the radiology report. ROC curves were created for individual readers, and reader performance change from pre- to post-CAD was plotted. Readers with the worst pre-CAD readings tended to improve the most. The average reader pre and post CAD use ROC curves do not cross.
The sponsor repeated the primary data analysis using ANOVA-after-jackknife and bootstrap methodology to demonstrate that the results were not sensitive to study design. In addition, the sponsor repeated the analysis using different reference truth and random truth; results were significant for all analyses. Additional analysis found a significant reduction in misses post-CAD.
Mr. Miller concluded by stating that the study demonstrates that the ImageChecker CT improves reader performance for detection of actionable lung nodules. The results are robust to analytical methodology, choice of reference truth (i.e., consensus vs. majority), and variation associated with panelist selection.
Pablo N. Delgado, M.D., Clinical Associate Professor, University of Missouri at Kansas City, described the beta testing of the device. The testing took place at a resident training program at a private practice hospital and outpatient imaging center with a “typical Midwest community patient base.” The goals of the test were to assess the functionality of the CAD system, to answer R2 development group questions about reading practices and workflow, and to determine future training needs. It was not designed to assess clinical effectiveness. The study design involved retrospective review of CT cases from the hospital; the cases were read with and without CAD by 3 faculty and 2 residents. An R2 applications specialist was onsite for one day and shadowed the retrospective reading session. Participants received a description of the CAD algorithm and reviewed selected institutional cases.
The testing found a rapid learning curve for the ImageChecker CT. The system functioned as expected without technical errors or malfunctions; it identified missed nodules, and false CAD marks were easily dismissed by radiologists. Dr. Delgado emphasized that radiologists must review all images initially without CAD, because CAD may not identify all nodules. He concluded by reiterating that MDCT has led to increasing numbers of detailed diagnostic images and that many published studies have documented limitations in radiologists’ detection of lung nodules. CAD is an effective tool in the detection of lung nodules on MDCT chest exams.
Dr. O’Shaughnessy concluded the sponsor’s presentation by stating that the ImageChecker CT CAD Software System significantly improves radiologists’ ROC performance for detecting solid pulmonary nodules between 4 and 30 mm in size. She reviewed the proposed indications for use and emphasized that the device is intended to be an adjunct.
Panel members asked questions about the extent to which the device would reduce radiologists’ miss rate, and they expressed concern over how to factor rates of FPs and FNs into the study results. Mr. Miller replied that the data suggest a 20 percent reduction in miss rate, which is similar to the experience with CAD for mammography. False marks are easily dismissible. Participants were reacting to true positives. Dr. Stark asked Dr. Castellino to try to calculate the net decrease in FNs achieved with the device.
Another concern was bias in case selection; most cases were people with extrathoracic disease. Panel members also asked for information on user reactions to the device during beta testing and the average time spent per case, which sponsor representatives answered to their satisfaction.
Dr. Phillips described the device and its components and reviewed the indications for use. He then listed the FDA review team members. All FDA preclinical reviews found the device satisfactory.
William Sacks, Ph.D., M.D., Medical Officer, provided clinical background on the ImageChecker CT. The device is for chest CT scans made for any indication. It detects solid lung nodules between 4 and 30 mm using CAD, but it is not a diagnostic tool. CAD is intended to reduce the rate of missed lung nodules and increase the user’s sensitivity for detecting nodules. Dr. Sacks reviewed the instructions for use and emphasized that the reader must first review films unaided. The ImageChecker CT then marks the candidate nodules, and the reader looks at those marks. If CAD fails to mark a nodule judged actionable on unaided review, the reader should retain his or her initial judgment.
Issues involving this PMA include the CAD target, which consists of actionable nodules. In addition, the definition of truth in the clinical study was based not on biopsy or histology but on an expert panel. The unit of analysis is the person, broken down into lung quadrant. The endpoints are sensitivity and specificity of action recommendation and/or ROC area. Dr. Sacks reviewed the clinical study methodology.
Nicholas Petrick, Ph.D., Office of Science and Technology, CDRH, reviewed the clinical conclusions. Pre- and post-CAD ROC curves do not cross, and no substantial pre- and post-CAD crossing occurs in either averaged or individual ROC curves. AzΔ is a statistically appropriate performance measure. Panel variability in determining which nodules were actionable is an important issue: Only one-third of cases were unanimously defined as actionable.
The data set consisted of 90 cases divided into 360 quadrants. Variability, confidence interval, and significance testing consisted of ANOVA-after-jackknife and bootstrap methods. Dr. Petrick reviewed elements of each method. Statistical analysis found significant improvement in Az pre- to post-CAD (p = 0.003 for the ANOVA-after-jackknife analysis and p<0.001 for the bootstrap analysis). Results of the ANOVA-after-jackknife and bootstrap analysis are consistent. The analysis is limited, however, because it did not take into account any variation in the expert panel; this element is important because the study relied on panel truth, not a “gold standard” of truth. An important question is how the results would change with a different truthing panel.
The secondary analysis applied bootstrapping to the panel of experts as well. Again, the results were consistent among panels even though a decrease in significance was expected. The analysis found similar p values as before (<0.001 to 0.002). The secondary analysis took into account the random nature of the expert panel for defining actionable nodules. All variations confirm a statistically significant improvement in Az from pre- to post-CAD. The analysis is appropriate for assessment of devices when only panel truth is available.
CAD standalone performance results are important because radiologists can use the information to weigh their confidence in the CAD markings. Such results could also be used as a benchmark for future improvements to the system. To assess standalone performance, the sponsor examined the unanimous panel findings and found that many of those nodules did not meet the criteria for a solid discrete spherical density. A second panel of five independent radiologists then reevaluated the nodules for appearance, placing each nodule in one of two categories: classic or nonclassic. Sensitivity was about 59 percent, with a median false marker rate of two to three per case. The median diameter was 7.9 mm. Minimal bias with regard to size was found. The sponsor concluded that the large variation in performance of the CAD was based on physicians’ assessment of nodules as classic.
Dr. Sacks reviewed the clinical conclusions. Although the results found that a gain in Az of 0.02 was statistically significant, the clinical significance is not clear. CAD is intended to increase the user’s sensitivity, and the gain in Az understates the relative gain in user sensitivity. When CAD is used according to instructions—that is, to retain all judgments of actionability even if unmarked by CAD, the user always maintains or increases sensitivity and always maintains or increases the FP fraction. Thus, any statistically significant improvement in Az means an even greater relative gain in sensitivity; loss of sensitivity is possible only if instructions are not followed. Can we infer from improved average user performance (i.e., AzΔ) in this study that the average user will improve performance with CAD in current clinical practice? In actual clinical practice with CAD, the unaided Az could be lowered by failure to read with adequate vigilance. In that case, the aided Az could be lower than current (CAD-less) practice. Dr. Sacks asked the panel to consider the implications of such lowering of vigilance for judging the safety and efficacy of CAD.
Turning to the issue of labeling, Dr. Sacks stated that two rules, if followed by CAD users, will help prevent missing more nodules than in CAD-less reading: (1) Always read unaided images first, and as carefully as if no CAD were available. Doing so will help keep the Az of aided reading higher than the Az of CAD-less reading. (2) Never back off from the unaided judgment of actionability of a nodule if CAD fails to mark it. This rule will prevent sensitivity from falling below that of the current, CAD-less sensitivity.
David Stark, M.D., stated that the sponsor’s study involves enormous biological variation, and numerous coincidental clinical issues come into play. AzΔ is a red herring. The device does not address issues such as reading under conditions of fatigue because the radiologist still must do conventional work. The real problem is the FN rate, for which the device does not produce much improvement. In one of two studies, some degradation of performance was demonstrated. The FP rate is derived using a large denominator. The consequences of FPs are serious and include unnecessary biopsies and follow-up scans. In addition, the study does not provide data on use with contrast media.
Another concern is that the study ignored many human factors. The radiologists in the study had a narrow task that did not reflect real-world conditions. The radiologist has to protect the patient from numerous FPs. The training and warnings that will be given to the radiologist are limited, and the temptation to misuse the product is significant. Radiologists in the study did not have to deal with artifacts such as arm placement or use of the device with contrast media.
The ImageChecker CT CAD is an ambitious and complex product, but real-world pressures on radiologists are of concern. Effectiveness has not been demonstrated. A reread might be just as effective as the device. A postmarketing follow-up placebo study is necessary. The achieved gain in ROC performance is not clinically significant, and the p values do not justify effectiveness for FDA approval. This product will help some people but will hurt others in direct and indirect ways; it is not approvable.
Brent Blumenstein, Ph.D., stated that the sponsor’s statistical methodology was adequate, but certain aspects are problematic. For example, no measures of uncertainty were provided for clinical measures. AzΔ measures device performance, not clinical performance. In addition, the cases were artificially sampled, and population prevalence is likely not reflected in the data set. It is difficult to assess clinical impact without an assumed prevalence.
The correlation structure is related to quadrants. It is likely, however, that the correlation between the upper and lower right quadrants is stronger than the correlation between the two upper quadrants. The computations did not take that into account. Moreover, the panel’s knowledge of patient identity contributed to additional correlation structures; that possible impact on results was not taken into account. The computational method presumably assumes independent assessments, but would the p value have been different had correlation between assessments been taken into account?
The experiment did not measure intrareader variability, so the extent to which a measure of intrareader variability would modify the AzΔ p value is not known. Intrareader variability likely would be particularly important in computing the variability for clinical measures. Artificial scaling in ROC methodology depends on assumptions about reader consistency, but clinical measures that depend on ROC do not take that into account. Had it been taken into account in this study, the results might have been different.
Statistical methods depend on a definition of truth, and the sponsor did the best that could be done, but the results are conditional on acceptance of that definition of truth. Truth is subject to degeneration, however. A study of the impact of variations (or degeneration) in readings would be useful, perhaps using covariates, different reading times, or sample quadrants.
Heber MacMahon, M.D. provided additional information in response to earlier panel comments. With regard to whether prompting for a second read could improve performance; he noted that because the reader does not reread the entire study but only examines specific marks, the opportunity to generate additional true positives is small. He also elaborated on why AzΔ is relatively small: The readers were highly vigilant, they only had to find nodules, and they worked in an ideal reading environment. They start out at a high level, so there is not so much change on the second read. Case selection was also a factor because the cases had a high probability of nodules and were not selected for subtle abnormalities. In a clinical environment, a 20 or 30 percent reduction in missed nodules might translate to increase of 5 percent in caught nodules. The sponsor is amenable to postmarket studies.
1. Please discuss whether the data in the PMA support the conclusion that the CAD can reduce observational errors by helping to identify overlooked actionable lung nodules on chest CTs. In particular, given that use of the CAD produced a statistically significant improvement in ROC performance, please discuss whether (a) the use of an expert panel is appropriate for determining actionable nodules, given that a tissue “gold standard” is not feasible.
The panel concurred that the device can reduce observational errors. However, FDA should consider ways to obtain more data; for example, were the actionable nodules indeed actionable?
1b. [Please discuss whether] actionable nodules are a reasonable target for a lung CT CAD to be judged safe and effective.
The panel concurred that actionable nodules are a reasonable target, but follow-up is needed on patients to find out whether the nodules were truly actionable. Once the product is approved, some radiologists might not follow instructions, making the device unsafe. Panel members also wanted to know how many true negatives turned into FPs through use of the device.
1c. [Please discuss whether] the achieved gain in ROC performance (Az) demonstrates safety and effectiveness of the CAD.
The panel did not reach consensus on the question. Several members noted that the safety question is partly dependent on radiologists’ following the instructions for use. Others were unconvinced that effectiveness had been demonstrated: The device performs adequately, but the sponsor provided no measures of confidence bounds or clinical efficacy. FPs are of concern. In addition, the sponsor did not provide data from Europe, where the device is currently marketed. Data on reproducibility in a clinical context are needed. Several panel members felt that despite the device’s limitations, any increase in nodule detection is of benefit.
2. Please discuss whether the labeling of this device, including the indications for use, is appropriate based on the data provided in the PMA.
Panel members suggested several modifications to the labeling. Because it is inaccurate, the language concerning significance and the language referring to “high sensitivity and low false positive CAD marker rates” should be removed. The phrase “automatic CAD processing requires no user interaction” is not true because the radiologist must deal with FPs. Other panel members were not so concerned about the FP rate and noted that it compared favorably with that of mammography. The panel agreed that the labeling should include the two rules described by Dr. Sacks: (1) Always read unaided images first, and as carefully as if no CAD were available. (2) Never back off from the unaided judgment of actionability of a nodule if CAD fails to mark it. The labeling should also emphasize that certain types of nodules, such as ground glass nodules, were excluded from the analysis.
3. Please discuss whether the sponsor’s proposed training plan for radiologists is adequate. If not, what other training would you recommend?
Panel members suggested that at each location where the device is used, the sponsor should train a “super user” who would have additional training. In addition, the sponsor should provide a training CD-ROM that would provide guidance on discerning FPs and other demonstration cases.
4. If the PMA were to be approved, please discuss whether the above, or any other issues not fully addressed in the PMA (a) require post-market surveillance measures in addition to the customary Medical Device Reporting (MDR), etc. or (b) suggest the need for a post-approval study.
Panel members agreed that prospective data that will test the system in real conditions are needed. The follow-up also should evaluate the clinical impact of the device and its efficacy in less common situations (e.g., with patients who cannot put their arms over their head) and with pediatric cases.
No one asked to speak.
Executive Secretary Doyle read the voting options. Dr. Mehta stated that he would abstain from voting due to poor telephone transmission quality. A motion was made to recommend that the device not be approved, but it was defeated in a 5-2 vote with 1 abstention. The panel voted unanimously to approve the device with the following conditions:
1. The sponsor must provide results of the clinical study by patient; in other words, what did CAD find, and what was the actual outcome?
2. The labeling must state (1) that the radiologist must always read the CT films unaided by the device first, as carefully as though there were no CAD device, and (2) that the radiologist should never back off from unaided judgment of actionability of a nodule if CAD fails to mark it. In addition, the labeling should delete the second paragraph in the instructions for use, which refers to fatigue and lapses.
3. Training should be required, and the sponsor should provide a CD-ROM resource. In addition, the sponsor should consider some sort of remote review.
4. The sponsor should work with FDA to design and conduct a postmarketing study to collect case data. The study should evaluate the impact of CAD on other disease detection, and exclusion criteria should not be as strict as those for the study on which the PMA is based. The postmarket study should include children.
Panel members stated that the conditions should ensure the safety and efficacy of the device. They urged collection of additional data and reanalysis of existing data to help clarify the efficacy issue. Panel members also expressed disappointment that the sponsor did not provide a clinical analysis and that the agency did not require it. It was noted that the viewpoints of both lead panel reviewer were overridden by the rest of the committee. Other panel members suggested that any improvement in diagnosis will lead to improvement in patient care.
I certify that I attended this meeting of the Radiological Devices Advisory Panel on February 3, 2004, and that these minutes accurately reflect what transpired.
Robert J. Doyle
I approve the minutes of the February 3, 2004, meeting as recorded in this summary.
Geoffrey S. Ibbott, Ph.D.
Summary prepared by
Caroline G. Polk
Polk Editorial Services
1112 Lamont St., NW
Washington, DC 20010