FOOD AND DRUG ADMINISTRATION
CENTER FOR DEVICES AND RADIOLOGICAL HEALTH
RADIOLOGICAL DEVICES ADVISORY PANEL
FEBRUARY 3, 2004
The Panel met at 9:00 a.m. in Salons B-D of the Gaithersburg Marriott Washingtonian Center, 9751 Washingtonian Boulevard, Gaithersburg, Maryland, Geoffrey S. Ibbott, Ph.D., Acting Chairman, presiding.
GEOFFREY S. IBBOTT, Ph.D., Acting Chairman
BRENT BLUMENSTEIN, Ph.D., Temporary Voting Member
CHARLES B. BURNS, M.S., P.H., Non-Voting Consumer Rep.
EMILY F. CONANT, M.D., Voting Member
THOMAS FERGUSON, M.D., Temporary Voting Member
ELIZABETH KRUPINSKI, Ph.D., Temporary Voting Member
MINESH P. MEHTA, M.D., via teleconference, Chairman
DEBORAH J. MOORE, Non-Voting Industry Representative
STEPHEN SOLOMON, M.D., Temporary Voting Member
DAVID STARK, M.D., Temporary Voting Member
PRABHAKAR TRIPURANENI, M.D., Voting Member
ROBERT DOYLE, Executive Secretary
NICHOLAS PETRICK, Ph.D.
ROBERT A. PHILLIPS, Ph.D.
WILLIAM SACKS, Ph.D., M.D.
ROBERT F. WAGNER, Ph.D.
RONALD CASTELLINO, M.D.
PABLO DELGADO, M.D.
HEBER MacMAHON, M.D.
KATHY O'SHAUGHNESSY, Ph.D.
Call to order and the Panel Introduction, Dr. Geoffrey Ibbott, Ph.D., Acting Chairman................................. 4
FDA Introductory Remarks, Robert J. Doyle, Executive Secretary 7
Update on FDA Radiology Activities, Robert A. Phillips, Ph.D. 13
Open Public Hearing
Open Public Hearing; interested persons may present data, information, or views, orally or in writing, on issues pending before the committee 14
Open Committee Discussion
Charge to the Panel, Dr. Geoffrey Ibbott, Ph.D. 16
Overview of Contemporary ROC Methods, Robert F. Wagner, Ph.D. 17
Presentations on P030012 by Sponsor
Introduction, Kathy O'Shaughnessy, Ph.D.. 95
Current Clinical Practice, Heber MacMahon,
Device Description and Clinical Trial
Introduction, Ronald Castellino, M.D. 103
Clinical Study, Dave Miller............. 115
User Experience, Pablo Delgado, M.D..... 143
Summary, Kathy O'Shaughnessy, Ph.D...... 148
Presentations on P030012 by FDA
PMA Overview, Robert Phillips, Ph.D..... 174
Clinical Background, William Sacks,
Ph.D., M.D........................ 175
Clinical Results, Nicholas Petrick, Ph.D. 179
PMA Review Summary, William Sacks,
Ph.D., M.D........................ 202
Reports by Panel Lead Reviewers
David Stark, M.D........................ 212
Brent Blumenstein, Ph.D................. 225
Presentation of FDA Questions................. 232
Panel Discussion.............................. 234
Open Public Hearing
Open Public Hearing: interested persons may
present data, information, or views, orally or
in writing, on issues pending before the
Open Committee Deliberations
Panel Recommendation(s) and vote........ 311
DR. IBBOTT: I would like to call this meeting of the Radiological Devices Panel to order. I also want to request that everyone in attendance at this meeting be sure to sign in at the attendance sheet that is available outside the door. I would note for the record that the voting members present constitute a quorum and is required by 21 CFR Part 14.
At this time I would like each panel member at the table to introduce himself or herself and state his or her specialty, position title, institution, and stages on the panel.
I'll begin with myself. Some of you have already figured out that I'm not Dr. Mehta. Thanks to the vagaries of air travel and weather, Dr. Mehta is unable to be here but is joining us by speaker phone.
I'm Geoff Ibbott. I'm a medical physicist. I work at the University of Texas, M.D. Anderson Cancer Center in the Department of Radiation Oncology and Radiation Physics. I'm a voting member on this panel and have been for several years. Obviously I'm standing in as chair for this meeting.
Then, Charles, let's start with you and we'll go around the table and introduce ourselves.
MR. BURNS: Charles Burns, Professor of Radiologic Science at the University of North Carolina. My primary expertise is Imaging Diagnostic Physics and I'm a nonvoting consumer representative
DR. IBBOTT: Thank you.
DR. MOORE: I'm Deborah Moore. I'm the Vice President of Regulatory and Clinical Affairs for Proxima Therapeutics. I'm the industry representative for the panel and a nonvoting member.
DR. STARK: I'm David Stark. My current title is President of MRI of Dettum in Massachusetts. I'm a clinical radiologist. I've been a chairman for close to nine years and I know many of you. I'm pleased to be here. Thank you.
DR. TRIPURANENI: Prabhakar Tripuraneni. I'm head of Radiation Oncology at Scripps Clinical in La Jolla, California. I have a practice and full-time clinician radiation oncologist and I am a voting member. I think this is my first or second date on the panel.
DR. DOYLE: I'm Bob Doyle. I'm the Exec. Sec. of this panel.
DR. BLUMENSTEIN: I'm Brent Blumenstein. I'm a biostatistician in private practice. I'm normally on the General and Plastic Surgery Panel.
DR. SOLOMON: I'm Steve Solomon. I'm a radiologist at Johns Hopkins Hospital. I'm a consultant to the panel.
DR. FERGUSON: I'm Tom Ferguson, professor emeritus of cardiothoracic surgery at Washington University School of Medicine, St. Louis. I'm a temporary voting member on this panel. I'm on the Cardiovascular Device Panel.
DR. CONANT: I'm Emily Conant. I'm the Chief of Breast Imaging at University of Pennsylvania and sort of half research and half clinical at this point. I'm a voting member.
DR. KRUPINSKI: I'm Elizabeth Krupinski from the University of Arizona. I'm a research professor in the Department of Radiology. My area of expertise is observer performance and image perception studies. I'm a voting member.
MS. BROGDON: I'm Nancy Brogdon. I'm not a member of the panel. I'm the liaison to the agency. I'm the Director of the Division of Reproductive Abdominal and Radiological Devices.
Dr. Mehta, would you like to introduce yourself?
DR. MEHTA: Yes, please. I'm Minesh Mehta. I'm a radiation oncologist in terms of specialty and I'm the Chair of the Department of Human Oncology at the University of Wisconsin. Generally when I'm there I'm chair of the panel but today I guess I'm listening in.
DR. IBBOTT: All right. Thank you, everyone. Mr. Doyle would now like to make some introductory remarks.
DR. DOYLE: Well, first on the agenda here is appointment of the Acting Chairperson. Pursuant to authority granted under the Medical Devices Advisory Committee Charter dated October 27, 1990, and as amended August 18, 1999, I appoint Geoffrey Ibbott, Ph.D., as Acting Chairperson of the Radiological Devices Panel Meeting on February 3, 2004. This is signed by David Feigal, the Director of the Center of Devices and Radiological Health.
Now I would like to read the appointment of temporary voting status. Again pursuant to the authority granted under the Medical Devices Advisory Committee Charter dated October 27, 1990, and as amended August 18, 1999, I appoint the following individuals as voting members of the Radiological Devices Panel for the meeting on February 3, 2004, and they are as follows:
Brent Blumenstein, Ph.D., Thomas Ferguson, M.D., Elizabeth A. Krupinski, Ph.D., Stephen Solomon, M.D., and David Stark, M.D.
For the record, these individuals are special government employees and consultants to this panel under the Medical Devices Advisory Committee. They have undergone the customary conflict of interest review and have reviewed the material to be considered at this meeting. Again, signed by David W. Feigal for the Center of Devices and Radiological Health.
Finally, the conflict of interest statement. The following announcement addresses conflict of interest issues associated with this meeting and is made part of the record to preclude even the appearance of impropriety.
To determine if any conflict existed, the agency reviewed a submitted agenda for the meeting and all financial interest reported by the committee participants. The agency has no conflicts to report.
In the event that the discussions involved in any other products or firms not already on the agenda for which an FDA participant has financial interest, the participants should excuse him or herself from such involvement and the exclusion will be noted for the record.
With respect to all other participants we ask in the interest of fairness that all persons making statements or presentations disclose any current or previous financial involvement with any firm whose products they may wish to comment upon.
Now, if there is anyone who has anything to discuss concerning these matters which I have just mentioned, please advise me now and we can leave the room to discuss them. Seeing none, the FDA seeks communications with industry and the clinical community in a number of different ways,
First, the FDA welcomes and encourages pre-meetings with sponsors prior to all IDE and PMA submissions. This affords the sponsor an opportunity to discuss issues that could impact the review process. Second, the FDA communicates through the use of guidance documents. Toward this end, the FDA develops two types of guidance documents for manufacturers to follow when submitting a premarket application.
One type is simply a summary of the information that has historically been requested on devices that are well understood in order to determine substantial equivalence.
The second type of guidance document is one that develops as we learn about new technology. FDA welcomes and encourages the panel and industry to provide comments concerning our guidance documents. I would also like to remind you that the meetings of the Radiological Devices Panel for the remainder of this year are tentatively scheduled for May 18th, August 10th, and November 16th.
You may wish to pencil these dates in on your calendar but please recognize that these dates are tentative at this time. I'll repeat them in case you didn't get those. May 18th, August 10th, and November 16th.
DR. IBBOTT: Thank you, Mr. Doyle.
At this point Nancy Brogdon, who is Director of the Division of Reproductive, Abdominal, and Radiological Devices of the Office of Device Evaluation has a few words she would like to say.
MS. BROGDON: Thank you, Dr. Ibbott. We have three panel members whose terms just expired on January 31st. They are not present today but we wanted to recognize publicly their contributions to the panel.
The first is Mr. Ernest Stern. Mr. Stern was the Chairman and CEO of Thales Components located in Totowa, New Jersey, and he was the industry rep on the panel for the past four years. He is now retired from Thales.
Mr. Stern effectively represented various industries served by this panel and used his position on the panel to apprise other panel members of commercial considerations that they should take into account when making recommendations on the various applications under review.
Second is Dr. Wendy Berg. Dr. Berg was the Director of Breast Imaging in the Department of Radiology at University of Maryland at Baltimore. She served on the panel for four years as a voting member. Dr. Berg brought to the panel a high degree of expertise in the field of mammography.
That was continually called upon as novel mammography related devices were reviewed by the panel. In addition, when asked, she provided written reviews of complex devices applications that the agency used as part of our in-house review process.
Third is Dr. Harry Genant. Dr. Genant is Professor of Medicine and Epidemiology, Orthopedics, and Surgery at the University of California at San Francisco. He also served as a voting member for four years. Dr. Genant brought to the panel a brought spectrum of expertise with special emphasis on bone densitometry. His probing questions and insightful comments on the pros and cons of the devices being considered were very helpful to the agency as it reviewed the safety and effectiveness of new devices.
We thank all of these past panel members. each will be sent a thank-you from the commissioner along with a mounted service plaque. Thank you.
DR. IBBOTT: Thank you.
Dr. Robert Phillips, the Chief of the Radiology Branch of the Office of Device Evaluation will now give a brief update on the FDA radiology activities. Dr. Phillips.
DR. PHILLIPS: Well, good morning again. As you can see by the absence of meetings between December '02 and now, we have not had a whole bunch of brand new PMAs that we've brought to the panel. In fact, in the last year we have not approved any PMAs.
However, there have been some changes in the branch itself and we have brought four new people on board as reviewers. These are Nancy Wersto who comes to us from industry. She's a radiological physicist and her interest area is in radiation therapy products.
Then we have Kish Chakrabarti who comes to us from the mammography side of the center. He is a physicist. His area of interest is mammography and imaging systems. Kish, are you here today? No.
Dr. Barbara Shawback comes to us from outside. She's a medical officer and her area is study and design in rheumatology.
And then we just had a new employee come on board, Sophie Packerel. She is a physicist who comes from the University of Chicago and her area is CAD systems.
Those are the four people that have come on board and ends my talk. Thank you.
DR. IBBOTT: Thank you. We'll now proceed with the first of two half-hour open public hearing sessions for this meeting. The second half hour open public hearing session will follow the panel discussion this afternoon.
Both the Food and Drug Administration and the public believe in a transparent process for information gathering and decision making. To ensure such transparency at the open public hearing session of the advisory committee meeting, FDA believes that it is important to understand the context of an individual's presentation.
For this reason, FDA encourages you, the open public hearing speaker, at the beginning of your written or oral statement to advise the committee of any financial relationship that you may have with the sponsor, its product and, if known, its direct competitors.
For example, this financial information may include the sponsor's payment of your travel, lodging, or other expenses in connection with your attendance at the meeting. Likewise, FDA encourages you at the beginning of your statement to advise the committee if you do not have any such financial relationships. If you choose not to address this issue of financial relationships at the beginning of your statement, it will not preclude you from speaking.
No individual has given advance notice of wishing to address the panel. If there is anyone now wishing to address the panel, please identify yourselves at this time.
Seeing none, I would like to remind public observers at this meeting that while this portion of the meeting is open to public observation, public attendees may not participate except at the specific request of the chair.
We can now begin the first open public portion of the meeting. We will now, as I said, proceed with the open committee discussion portion of this meeting that has been called for the consideration of PMA 030012 for a computer-aided detection, CAD device, that assist a physician in identifying actionable, solid nodules in CT images of the lung.
The first presentation will be by Dr. Robert F. Wagner of the FDA who will give an overview of contemporary ROC methods such as may be used in measuring the effectiveness of the CAD and other imaging devices.
The sponsor, R2 Technology, Inc., will then state its case for the PMA and they will be followed by the FDA with its review of the device. We will proceed now with Dr. Wagner's presentation.
DR. WAGNER: Cybersource as I am, let us see if I can -- okay. Progress or regress? Let's not start from the back. Marvelous.
Thank you very much, Bob. I'm glad we planned this together this way. Good morning to the members of the panel, my colleagues and visitors today. I must acknowledge the fact that Dr. Bill Sacks and I were awakened by our respective wives at our respective homes every two hours this morning to see what the weather would be like to see if we would be able to make it and what time we should really get up. We are working against that as our background.
I would also like to thank my colleagues for giving me this opportunity to present this tutorial information on an overview of the contemporary ROC methodology as it is used today in the field of medical imaging and computer assisted devices.
Of course, most of us know what the letters stand for. ROC stands for receiver operating characteristic. This is the historic name that comes down to us from the field of radar in signal detection studies where the problem is you're looking at a field of clutter and the question is is there an airplane in that clutter.
In the field of psychology and this perception in eye and brain coordination studies, this subject is often called the relative operating characteristic. Some people are just weary of the R and just refer to this as the operating characteristic because that's really what it is.
Those of us in the field of medical imaging have retained the name of receiver operating characteristic. I think it is because of our devotion to the classic literature from about 30 years or so ago that we have just retained, the conservative people that we are. I see a person who has worked in this field looking back at us.
Well, now here is an outline of the talk. We will spend a few minutes talking about efforts toward consensus development on the present issues. Then we'll move right into the ROC paradigm. We'll talk about how it gets complicated by the problem of reader variability. How the multiple reader multiple case, or so-called MRMC ROC paradigm, arose to address this problem of reader variability.
Since the ROC is a measurement, you have to have a meter stick of some kind so we'll talk about measurement scales. There will be a categorical scale, patient management or action scale and a probability scale that we'll talk about.
Then for today's submission, and submissions like it, there are additional complications from the problem of location uncertainty, from the problem of not really knowing the truth and dealing with uncertainty in the truth. Since the truth is uncertain, you really don't know how many effective number of samples you really have.
When you have a system that's going to cue readers about the possibility of lesions on a case, there is a problem of reader vigilance that we will discuss. Finally, we'll give a little wrap-up which I won't have to give because Bob Phillips just presented it for me.
Let's start off now with efforts toward consensus development on the present issues. The fact is that at the moment we do not have an explicit FDA guidance on how to review, how to submit and review issues like the present one. There's been a lot of work going on and deep background as to how did we get here.
The basic idea is how do you use the classic concepts of sensitivity, specificity, and ROC analysis to assess performance of diagnostic imaging and computer-assisted systems. Especially since there are many new issues and levels of complexity that come to the fore as more complex technologies emerge.
At the moment you see there is really no software to do the assessment task of the problem we have before us. That's why I would like to talk about piecemeal, all the different pieces and what is known and what does exist at the moment because the sponsor had to put together a creative combination of these many things. So continuing on this little laundry list. I'll give you an historical laundry list of efforts toward consensus development on these present issues.
That's RSNA. Most of you recognize that. That's the big Radiological Society of North America meeting that's held every year in November in Chicago that makes this weather look very mild today. Then following RSNA by a few months is the big SPIE medical imaging meeting. At the SPIE meetings we generally handle the more technical aspects of the issues that come up at the RSNA.
Then there's a society that meets every two years called the Medical Image Perception Society of which Elizabeth Krupinski on our panel has been president for 40 years I think it has been. Elizabeth is the President of the Medical Image Perception Society. We hold various workshops and literature every two years.
In all these meetings every few years we do note progress in this field. There is tremendous progress going on but it's without a doubt still an evolving work in progress. We are still not at the holy grail point that we would like to be at but a lot of progress has indeed been made.
At the good old FDA at our center in CDRH here at the FDA. One of the methods that I'll be talking about today is the so-called multiple reader multiple case, the MRMC scheme which has already been used for several submissions.
It was used to break the log jam that was holding back digital mammography from the market place so the MRMC scheme that I'll talk about in a few minutes was used there. It has been used for all successful submissions of digital mammography PMAs to our center.
This method that we'll talk about in a few moments has also been used for a successful submission in the area of a computer aid for lung nodule detection on chest x-ray film that is in some way analogous to the present submission but it's just on plain film.
NCI, National Cancer Institute, also has lung image database consortium and workshops. This is an NCI funded group of five universities and the principle director of that project, I though I saw him come in a moment ago. There he is, Larry Clarke.
There are five universities that work as part of this consortium and they are seeking consensus on a number of things, one of which is how to put together a database of annotated films of the kind that you would use, annotated CT slice images of the kind you would use to train and test a classifier in this field of computer-aided detection and diagnosis in lung cancer screening for nodules.
So that project is about half-way through its five-year history. A good two years underway right now. They are also addressing consensus on the many issues that you have to deal with when you want to deal with such a product.
For example, how do you keep score statistically? Once you know how to keep score, then you can start to design the size of a database. How do you outline the nodules? How do you keep score when there's a hit when there is just finite overlap between what is known of the lesion and what the reader marks? We'll talk about this in a few moments.
Now, two of here in our center have been quite active members of this LIDC from the beginning. Let me see if I have another comment here. Yeah. The thing I would like to bring to your attention this morning is that there has been a great amount of communication among all these resources here. A number of us in our center here are active members of the research community in this field.
Many of us here and sitting just behind me have been very active in this area of applying these methods to several of the submissions in the area of imaging a computer-aided diagnosis. Several of us are very active members, Larry Clarke's group here.
What we have tried to do is see this as several quarters, four quarters if you will, if a quadrangle all holding the windows open to the others so the people who come in to us from industry at any given moment will know what is the state of the art from the academia, from our own center, and from the LIDC.
We presented them all the papers, all the current drafts even, and made sure that everyone knows what's on the other people's mind methodology wise that is outside the area of anything that is proprietary. Anything that is not proprietary is all strictly methodology or statistics. We have tried to keep these communication channels as open as we could.
Here we go with the promised little tutorial and the fundamentals of the ROC paradigm itself. The idea is, of course, that you have two populations, one a population of actually diseased people. You might think of these as people with diabetes, for example, and a population of people who do not have the disease.
You would like to have a test that puts out a result something like a volt meter or a biochemical assay or, in the case of a simple blood sugar test, this would just be the blood sugar concentration. You would love to have the world such that the two populations would be separated and you could just drop a threshold in here and say these patients are diseased and these patients can go home and not worry about it.
Now, in the field of medical imaging those of us who have done work in that field you don't have a simple meter or biochemical assay. What you get is a reader looking at about a million pixels of a picture and trying to get the features out of it and reduce that through what we call the subjective likelihood, subjective judgment or likelihood that case is diseased.
Now, as I say, this is really not quite the way the diabetes blood sugar test works but if you think of what I am about to tell you in that context for the next few minutes, you won't be far off base. It's not precise but it wouldn't be misleading.
So here is what happens more typically. The two populations are not separated. The diseased population and the nondiseased population as far as their test result is concerned have a very great overlap. The idea is now who do you send home and who do you send on for further workup or people that you want to treat for a condition.
Those of you who have seen this before, what I've just done I've taken these two and dropped this population down so that you won't get mixed up with the colors. Now we have the nondiseased cases and the diseased cases on the same axis, the same relative position. Now in a practical situation with the overlap, now we have to set ourselves a threshold.
If this is a blood sugar test, for example, you could set it at 150 blood sugar level. If you do that, you'll pick up about half of the actual diabetic patients so we say we have a true positive fraction of 50 percent but you have to pay for this price. You have about a 10 percent false positive fraction so here is this point, 50 percent true positive and roughly 10 percent false positive.
We call this a less aggressive mind set and I think you'll see the reason for that in just a moment. So if we get a little bit more aggressive to try to pick up more patients in our sieve, we might set the threshold down here at 100 instead of 150. Now we get about 80 percent of the diabetic patients and now at the price of about 20 percent false positive or 25 percent. Here I've put this point about 80 percent and 25 percent.
Let's get even more aggressive and what I mean by that is I want to pick up more diseased patients in my sieve, the sieve being the test. If you set the threshold in the 90's, now we might get almost 95 percent of the patients in our sieve of the actual diabetic patients but then we have to pay the price of 50 percent of the nondiabetic patients picked up so now we have a 90 percent sensitivity and roughly a 50 percent sensitivity.
Now, you can take this to the extreme and we talk about this particular test all the time and I think this might not work because the threshold now -- oh, it did work. Okay. We can put the threshold all the way to the left and call everybody to the right of this diseased and we would get all the diabetic patients. There's a little mark right up here. We would get also -- the price we would pay is we would have to call everybody who is not a diabetic a diseased patient here so we would generate that point.
I think you can see and let your imagination go wild that you can certainly fill in all these points. Don't blink, anyone. I saw Dr. Bob Doyle blink there so I have to go back and do that again. Instead of working up more and more levels of aggressiveness, you could back off. You could start off with everybody at the sick point and then just back off, move the threshold the other way and fill in the complete ROC curve. You can see at this time of day I'm very easily amused.
Okay. Here is the overall picture now. This is the case of the schematic of, let us say, blood sugar as a test for diabetes. These are these two populations and the way they overlap and here is the corresponding ROC curve with the level of aggressiveness increasing.
Now, it can happen and, in fact, we've seen things like this in our center and you see this in the laboratory once in a while, the two populations could fall right on top of one another so that a test cannot actually discriminate between the two conditions so what we've done here is just drop this population and this population on top of each other. Now if you generate an ROC curve the way I just showed you, you would generate what we call the chance line or guessing line.
Toward the other extreme you could have a test that separates the two populations very well. In that case, as we move the threshold across from less aggressive to more aggressive, we'll generate this ROC curve. Now we have the guessing line, we have the ROC curve corresponding to almost typical clinical laboratory test, and we have the ROC curve here for a very good test. We call this the level of increasing -- we call this direction the direction of increasing reader skill or increasing level of technology.
Now, many people like to have a single summary measure of ROC curve performance and what has traditionally been used is you take the area under the curve so the area under this curve, say the diabetic discrimination test, is something in the high 70s. Let's call it 78 percent or something like that.
If you use the area under the curve as a summary measure of performance, in effect, remember if you think of calculus, you're getting this area you're just integrating, you are effectively replacing the curve with a line that is fault at the level of that area.
In effect, what you've done is you have averaged the sensitivity with a true positive fraction over all false positive fractions. In effect, if you use the area of the curve you are given the sensitivity averaged over all false positive fractions or sensitivity averaged over all specificity, specificity coming from the other direction.
Well, I hope it gets interesting now. That was the easy part. That's the idea. Let's see what really happens in the real world. In the real world in the last decade those of us who work in this field have been made acutely aware of the complication of reader variability.
I'm going to show you some very famous data. I think Emily Conant knows this like the back of her hand from having worked with Craig Beam. For those of you who have not seen this before, I have to give a little build up to this.
This is a set of data from Beam, Layde and Sullivan that I'm going to show you in which they studied 108 mammographers randomly chosen from around the United States. The mammographers in this study were given a set of mammograms. They were asked to set their threshold for action.
Remember when we were talking about this ROC paradigm we were moving a threshold and we wanted to set it at some place and the question is in a clinical laboratory test you could just dial that in somehow. How do you do it in medical imaging? You don't have a dial.
You have to deal with the human reader and they were asked to set their threshold between their sense of the boundary on the BIRADS scale, Breast Imaging and Reporting and -- Reporting or Recording? Anyway, Reporting and Data System. That's the American College of Radiology Scale that is used for managing patients in mammography.
These readers were asked to set their sense of the boundary between category 3, which is generally six-month follow-up recommendation, and category 4 which is highly suspicious and recommend consideration of biopsy. I'm sure I'm garbling that but you get the general idea. I wasn't asked to leave the room so I couldn't be too far off there.
Here's what happened. This is a true positive fraction versus a false positive fraction for 108 readers. There are 108 points here. Each one of these people thinks that they had set the boundary between category 3 and category 4.
If you try to do public policy based on category 3 and category 4 and thinking that people have optimized that, the optimum is very broad. People have not figured out how to optimize that. That's a big problem.
Let's look at this reader. This is one out of 108 people. This person has a sensitivity of 70 percent and a false positive rate of about 25 percent. Now, this person thinks they are being as aggressive as they should be in the context but this person is more aggressive than this one, this reader is more aggressive than this one, this reader is the most aggressive on this bottom curve here, and these readers are less aggressive.
Now, as we go in the other direction, we now see the variability due to the range of reader skill. We can say that these readers have a greater skill at this task than these readers and these readers have the greatest skill yet.
At any level of reader skill we have different readers thinking that they have optimally set their threshold. This is a tremendous range of reader variability. There are 108 mammographers represented on this graph. This is classic work from Craig Beam, Peter Layde and Dan Sullivan.
What have I just told you? There is no unique ROC operating point. Each one of these people is set to be at a certain operating point. There is no unique ROC operating point. There is not even a unique ROC curve. There is only a band or region of ROCs as you can see. There is a very broad band.
I hope I've convinced you all now that this gets to be a more complex issue. In particular, here is the question. Suppose we have two technologies that manifest themselves in reader's hands with this level of variability?
How do you compare those two technologies? That's the issue before us with a whole class of problems that we've been discussing over the last few years and we'll be seeing more of over the next few years. How do you do it?
This is not an isolated example. People have gotten used to this and said this is really an extreme example. This is not the most extreme example we've ever seen.
In our group we have actually looked at over a dozen real world publicly available data sets and the example I just showed you is sort of in the middle. Sometimes things are a little bit better. Sometimes they are even much worse than what I just showed you. Sometimes things are a little bit better. Sometimes they are even much worse than what I just showed you. The following is an example from Dr. Jim Potchen from plain chest x-ray picking up the disease on chest films. These are ROC curves. Dr. Potchen looked at over 100 radiologists and 71 residents. He averaged the score card ROC wise of his top 20 radiologist. Here they are.
Then he presents here the average ROC curve for his radiology residents. There are 71 of them here representing this average line. The bottom 20 radiologists in the study performed here. The range that we see here is comparable to what we saw in the Beam, et al. study for mammography. So this is the real world.
Well, you can imagine that if you wanted to keep score under that setting you have to use a lot of readers and a lot of cases. The paradigm that has emerged to address this is, thus, called, almost eponymously, I guess, if I could pronounce that word, the multiple reader multiple case, or MRMC paradigm.
There are a lot of designs for this. There are many ways to do it. Today we will just talk about something that is called the fully -- oh, I forgot my prop. We'll talk about the fully-crossed design. The fully-crossed design is one of many but it is the most efficient in some way so we will talk about it.
You match cases across modalities and you match readers across modalities. If I can pull this off. I'm used to having leaves of paper here. Okay. You have a bunch of patients who have been imaged with modality A here. The same patients imaged with modality B so we say that the cases are matched across modalities.
If we were working with computer-aided diagnosis, modality A would be readers reading without the computer aid and modality B would be readers with the use of the computer aid. There is a stack of images here. Same patients.
We recruit a panel of radiologists, something like 15 of you people here. All of you read every patient case in both modalities. What we have then is we have the cases matched across modalities and we have the readers matched across modalities.
This design is the most statistical power for a given number of readers and for a given number of cases with verified truth. Thus, we say it's the least demanding of these resources. Around here in Rockville we speak of this as the least burdensome paradigm because you probably heard in previous meetings that the FDA has been commissioned by Congress to enable sponsors to seek and to find, if possible, the least burdensome path to the marketplace through the review process.
So what we've done is we've always called this to the attention of incoming sponsors that this design is most powerful. You can use alternative designs and you can come close sometimes to the efficiency of this scheme but this is the most powerful in terms of the ground rules I have on the slide right there.
Well, if you are familiar with the literature in this field, you will say, you know, this is no modern big deal. This stuff has been known for a good 20 years or so. If you read the classic book by Swets and Pickett the whole idea is laid out there. The trouble is there was no practical way to implement this scheme 20 years ago until people started to understand what's called the statistical approach of resampling strategies.
I probably shouldn't spend any time on the past history but the fact of the matter is in past years before they realized about resampling they just started to stratify the data and then you give up a lot of statistical power. In modern times in the last 10 years people realized if you use the statistical resampling, you can use the data over and over again in a well-pedigreed way and get statistically valid inputs.
So the two most famous resampling schemes are called the statistical jackknife or the statistical bootstrap. The big break through came in this field in 1992. This is the classic so-called DBM paper. That's Donald Dorfman of happy memory whom we lost to out community very sadly two years ago. His colleague, Kevin Berbaum, and the well-known Charles Metz at the University of Chicago.
This paper broke the log jam in this field. They suggested using the statistical jackknife in combination with classical ANOVA and the statistical jackknife just being a leave-one-out method where you leave Mrs. Jones out one time and you leave Mrs. Smith out the next time and you generate a lot of data sets that way, submit it to classical ANOVA, and you can do your inference about the difference between these two competing technologies.
Well, it turns out this is a little bit more difficult to explain in any more detail than that. But the bootstrap method is very trivial to explain in some detail so I'm going to ask you to sit through that with me for the next minute or so.
The idea with the statistical bootstrap is that we are going to -- the bootstrap itself means you are going to resample from a set of data points with replacement. I'll show you that in a moment. We are going to bootstrap the experiment of interest. We'll draw random readers, random cases, and then carry out the experiment of interest many times.
Here is an example of some possible bootstrap samples from a set of -- suppose there are 15 of you here. We might have a set of numbers one through 15. We start drawing them with replacement. If you wait long enough, you might get a list that has one, two, three, four, five, six, seven -- you have to wait a long time before that happens.
In the meantime you get more random looking samples like this. When I was thinking about this, you know, if you did this with letters this reminds you of that proverbial experiment where they have the monkeys trying to type out the soliloquy of Pollonius or something like that. It's going to happen but you may have to wait a long time.
Instead what you do is you get random samples like this. The number one never showed up in this group. The number two showed up once. Number three showed up a couple times. Number 14 showed up three times and so on. You randomly sample a number and then put it back. Write it down. This can go on for an astronomical number of times.
Then another example, the number one shows up, number 15 shows up and so on. You get a lot of these, a very great number of these but you don't have time to do them all so, in practice, people use about 1,000. It depends on the complexity of the problem.
So you draw about a 1,000 bootstraps of readers and cases. The number of cases you draw is comparable to the experiment you are trying to mock up. Then what you do is with that bootstrap safe on the random case sample, you have all the readers in their bootstrap sample read all the cases in both modalities in that bootstrap sample, carry out the experiment of interest so you would get the performance measure.
That's called area under the RC curve for the one. You get that number for the other. You take the difference. You do that 1,000 times and then you put them in order from the lowest different to the highest. Then it's very easy to get the mean and then you can take out the central 95 percent junk and that would give you a 95 percent confidence level. That's a simple way to explain the story.
In the jackknife plus ANOVA it's a little bit more elaborate than that but you can actually think of the jackknife as the first order of approximation to the bootstrap. So these two approaches are sort of in the same spirit but one is completely nonparametric and the other is -- the classical ANOVA is heavily based on the multi-variate normal so it's highly parametric.
As I just said, you obtain a mean performance over readers and cases but it's much more interesting. The mean is always easy to get no matter how you approach a problem. Well, it can be tricky. But the big thing you want is error bars that account for both the variability of readers and cases.
You know, in the DBM paper they quoted a quote that has become very famous from Jim Hanley. Many of us know Jim Hanley from McGill University in Montreal.
Jim Hanley says, "When you report the results of your experiment to your readership, it's not so important just to report the mean performance or the results you got in the very experiment at hand because, after all, this experiment will never be done again. No one will ever do this particular experiment.
What readers want is they want a sense of the range of performance to be expected if this experiment could be repeated many, many times drawing randomly, one hopes, from the same population from which the current samples were drawn. So that is the idea.
You ought to be able to report to your readership not just a p-value because we all know it takes p-value to get a paper published in a medical journal. You want to actually be able to explain the range of variability you expect to see if this experiment is done over and over again. That's what you get when you keep score this way.
Okay. We said that the ROC curve is a measurement. Above all else it is a measurement so you have to think about a measurement science. You have to think about the scale you'd be using for reporting and doing the measurements.
Historically -- I should just stop for moment to tell those of you who were not around in the late '70s and early '80s that the National Cancer Institute gave a contract to people in Cambridge, Massachusetts, Bolt, Beranek and Newman, where John Swets, David Getty, and Ronald Pickett and colleagues were working to develop a protocol for how to do ROC experiments and how to keep score and how to do the data analysis.
That is published in a paper in science 1979. The book came out in 1982 and many of us have that book on our shelf. The protocol used at that time was so-called historic ordered category scales. There was no does this patient go to biopsy or not. You just looked at the case and you said this patient -- you use five or six categories.
One patient you might say this patient almost definitely does not have disease. There are several intermediate levels. The patient probably does not have disease, might have disease, probably does have disease, or almost definitely has the disease. That scheme of five or six categories was almost exclusively used and there was software for analyzing that for 25 years.
I'm being a little defensive because people may say why do people use that. That was approved by -- the experts in the field put it out and it was supported by NCI. There was a lot of science underneath it and today people say, "Why did people do that?" Well, that's what they had.
In the last 10 years in the field of mammography we have this BIRADS scale which is what we call an action item or a patient management oriented scale. In that idea you don't categorize the data. People think of the BIRADS scheme as a categorization scheme. Let's just put that to the side for a moment.
We'll just think of using the BIRADS scale to dichotomize patients. We'll say these patients will not be followed up at all versus these patients who will get a six-month follow-up. That's one way to dichotomize the data.
Another way to dichotomize the data is to say we will try to make the break as we did with the Beam, et al. data. We'll make the cut in this dichotomization between those patients who would get six-month follow-up versus those who we think should be biopsied right now. So this is a patient management scheme. This is just a dichotomization scheme.
About 10 years ago people realized for very technical reasons that it would be useful to use what they called the continuous probability rating scale, or quasi-continuous. It's a hundred-point scale, one, two, three, four, five, but you wouldn't get 1.5 for example so they call it quasi-continuous, hundred-point scale.
Nobody expects anybody literally to use probability 13 or probability 17 or anything, but the idea is to scale your probability or your sense of the likelihood of disease along a probability scale. That seems natural to use something if it's a probability on a scale from zero to 100.
So this is the most popular scheme that's been used to generate ROC data in the last five or seven years or so. This felt strange to many people, especially people who are used to using the categorical scale. But I've talked to a lot of people about this and very few people outside of the mammographers have read the BIRADS document.
If you go through the BIRADS document and you go to category four, which is suspicious and recommend for biopsy, it actually tells you there that the radiologist should tell the referring physician their sense of the probability of cancer. There is actually a culture already existing in which you can use this kind of patient management action items like a BIRADS three, four, five, and at the same time give a continuous probability of disease rating.
I see some puzzled looks. I'm trying to figure out just what I should comment on next. So to make a long story short then, this continuous probability rating scale has been used for most ROC curves generated in this community for the last eight or so years. In the breast imaging --
Oh, I remember what I was going to say. That's why I'm stalling here. In the breast imaging community many people, it may not be more than half, but people do use this BIRADS scale. But it's really important to realize that this BIRADS scale was not generated -- was not designed to generate ROC curves. People who have tried to use a five-category scale in this scheme and the BIRADS scale at the same time have met with a lot of confusion. It does not work out very well and I see somebody who may have witnessed people having that experience.
Well, I gave a lot of background here because I would like people to understand that this is a real issue for the community you would really like to have because every clinician says, "I want to know the patient management and I want to know the score card of the patient management." Every clinician you talk to, that's what they want.
Everybody who measures ROC curves says, "I want to measure it as finely as I can. I want to use this quasi-continuous reporting scale." The best of both worlds would be to get both the quasi-continuous rating to get the ROC curve and the patient management action item to get a single sensitivity specificity point.
I'll get a little dramatic for a moment here. I've talked to many friends. I'm very familiar with the literature. I could find one example in all the literature at the moment that's in print where both of these were done. I could only find one example of where the best of both worlds was done. This is a paper on classification, what Bill Sacks and others called CADx using a computer not to detect but to classify lesions on a film that are already known. I know that I have a stack of films here that have microcalcification clusters on them. My task is just to say which ones are benign and which ones are malignant. That's the task. But I'm going to keep score ROC wise and I'm also going to keep score patient management wise. I'll show you what they got in a moment.
These authors -- Yulei Jiang, I guess, was expected here today from a group in Chicago under Kunio Doi. They studies this test and they had 10 readers and they studied the complete ROC curves. They studied all the summary measures and they also studied the patient management or the action item, sensitivity specificity point.
Here are the results. Here is the average of 10 ROC curves for 10 readers trying to make this dichotomy, trying to make this distinction between benign and malignant lesions. Here is the ROC curve in the unaided by computer condition. This curve was generated using the hundred-point probability scale.
This is the curve in the computer-aided condition, again generated by the hundred-point probability scale. This point is the mean sensitivity specificity point generated just by making the threshold, dichotomizing the data. These patients benign, these patients malignant. This is a single dichotomy patient action point in the unaided condition.
That's the same point in the aided condition. You would love these points to fall on top of the curves and, for all statistical purposes, they do because remember the mean -- I have to remind you of this famous joke that we use around here. There was a six-foot statistician. You know what happened to this fellow, right? He drowned while wading in a stream that had an average height of five feet. You have to know about the variability.
This is not about means, okay? This curve moves all over the place and this curve moves all over the place in practice. This is the average of 10. Same thing. This point moves all over the place as does this. For all practical purposes this is a great experiment. This point falls on that curve.
Well, it's the only case I could find in the literature. How come you don't see more of this? When you live with these people that I live with, it's a great crowd of people and the clinicians say, "I want the action point." I say, "The committee wants to measure the ROC curve." Everybody says, "Let's do both." We are trying to come to that position. Why don't we see more of it?
Well, the area under the ROC curve, remember, you have your ROC curve and you've got the area under it. You are essentially getting the sensitivity averaged over all specificities. Right? You're averaging. You're going to average away a lot of noise.
The variation -- the variance of the area under the ROC curve -- oh, my goodness. The most important number of my entire talk is missing. The variance of the area under the ROC curve is the binomial variance over two. There's a two here, a very important two. Those of you who know me know I'm an expert in factors of two. It's the binomial variance over two.
What's the binomial variance? Well, I thought if you had a group as we have here today, about a third of you -- maybe 40 percent of you as I look around -- know what the binomial variance is. Suppose we had this meeting next week and we drew from the same population from which you all came.
The next time we did it we might get 32 percent of you might know what the binomial variance is. If we do it three weeks from now and joint another group in, maybe 49 percent or 52 percent of you will know what the binomial variance is.
What we've just done is what Bill Sacks refers to. We just made a self-referential example here. The binomial variance is the variance I would experience if I did the experiment I just discussed with you. The area under the ROC curve experiences only half of that variance.
If I studied sensitivity by itself and was able to tell you ahead of time what the specificity was so you didn't have to estimate the specificity, the variance of sensitivity is the entire binomial variance.
In the real world you have to estimate both the specificity and the sensitivity so the uncertainty in the specificity propagates into that and the sensitivity so the variance for that. So if you wanted to estimate the uncertainty in that action item that I showed, that point, the circle or the triangle in the previous data, if you were to estimate that, you would have to live with an uncertainty that was greater than the binomial variance.
If you use area under the RC curve you get a great reduction. You get the binomial variance over that famous factor two. This is all approximate but it works out very well with very practical examples.
So what we say is that the variance of the ROC area is the least burdensome approach to putting quantification into this problem. I remind you that is something that we are supposed to enable sponsors to appreciate.
Another thing that we realize in many discussions with academics and within our house and with the sponsors and so on is if you want to live in both of these worlds, that requires consistent conventions. If you want to be able to either get categorical reporting and the BIRADS reporting, that's a lot of work to try to get people to be consistent that way. People have dropped the categorical scheme for all practical purposes.
Even if you want people to be consistent between BIRADS and the quasi-continuous scale, that's difficult. We've seen a lot of data in our own group and from some of the universities. When you train people, this can be done but not everybody is trainable right away to be able to do this so it's an issue. To get data in both worlds then, it's going to require some convention development.
My final point here says this may require consensus bodies to promote the practice. We would hope that the American College of Radiology, some of them other professional societies, and even the fact that this is of interest to NCI and the FDA, we would hope that some this would encourage people to try to do measurements so that we could get both the point and the curve. Then I think everybody would be happy.
Well, this brings us to a little interim here. Some of you are very familiar with the next few slides. These are what we call the most famous slides in the RC archives. Those of you who know Charles Metz have seen this many times and his followers will use these many times. Charles died using these slides over 25 years ago.
Here's the classic question. You have two diagnostic modalities, modality A and modality B. Which one is better? You look at them and you have people doing public policy thinking in their minds. Which one of those is better? You start calculating something you've seen in a statistical decision theory book.
But the way this is approached in the field of medical imaging is the following. There are several possibilities here. Those two points may lie on completely different ROC curves. In that case we say that modality B is unambiguously better than modality A because at any false positive fraction the sensitivity of A is lower than that of B.
There's a different scenario. The two points could fall on the same ROC curve. Then you have these same people scratching their heads and saying, "Where should they really operate?" Well, in principle we believe that readers can move their level of aggressiveness. Not on any fine scale but we know that they adjust depending on the risk group their seeing. Some people do move around on their ROC curve so in principle these two points are in equivalent modality.
As I say, people will for years say, "There must be one of these operating points that's better than the other." Remember when I showed you that data from Craig Beam you saw people at every level of aggressiveness. Each one of these people in some way thinks they've optimized.
This is what we call the expected utility function or the expected value function. Every one of those people thinks in some way they have found the optimal operating point but they disagree with each other so this is another reason for using the ROC method.
There's yet another scenario. ROC curves may actually fall in such a way that modality A is everywhere higher than modality B. For the same reasons we would say that modality A is the superior modality in this scheme. Three different possibilities. B higher, equivalent, A higher.
This is the motivation for trying to get a finer measurement on this hundred-point scale. Then if the clinicians really want to know about the actual operating point, that is another step and we are all for that if you can coordinate the measurements but it's very difficult to do that.
Well, I'm sure many of you are sitting there thinking what about if the ROC curves cross? We know if that happens the situation enters the world of ambiguity. Then you can no longer necessarily use the total area under the curve as a sufficient summary measure of performance.
Other summary measures may be necessary. There are any number of other ways to make a summary measure of curves that cross. You can use partial areas. There's actually software even for that today. Or you can use parametric summaries of the curve and there are several other ways to look at this.
If you decided you're going to use other summary measures, if you anticipate this possibility, the study protocol is expected to address this because if you wait until after the study and say, "I was going to use the partial area in this region," we have a name for that. That's called data dredging. You have to build that into your study up front. Otherwise, when people do not expect to see the curves cross in any real way, they tend to use the area under the curve as a summary measure.
Well, for submissions as are coming before us in the area of computer-aided detection schemes, there is a question of how do you keep score for the location scored. I must remind you this is shocking to people who have never heard this before.
The basic ROC paradigm is an assessment of the decision making at the level of the patient. You don't say, "Where does the patient have diabetes?" You say, "This patient has diabetes." Or you say, "This patient has TB." You don't say, "The TB is here." You say, "This patient has TB." So the score keeping until recent years has been based on decision making at the level of the patient.
In more complex imaging you want to do the assessment of the decision making at a finer level. You would like to assess how well the localization was done. Well, there are little errors there that come across funny. If you do localization, of course, you will be providing the experimenter with more information.
If you have more information in the study, you get more statistical power. The trouble is to do all this adds complexity to the experiment. I would just like to review for you a couple of the highlights of the issues that have come up when you try to do location specific ROC analysis, so-called LROC for location specific ROC analysis.
The biggest problem is that if you want to keep score of a hit, the measurement of the hit depends on the criterion you use for localization. If the legion really is here somehow and you draw your circle and you say the legion is here, there is a certain amount of overlap and you would be surprised to see how sensitive the measurements are to that degree of overlap to the criterion you use for that. That's a real issue. There's no unique result. There's no unique LROC curve at the moment for the state of the field.
There are a couple of subtle points here that are very technical. I would just like to mention one of them. People have studied this for 20 or 30 years. For a certain class of problems if you study the ROC and if you study location specific ROC, the curves in the summary figures tract with each other monotonically.
If the one goes up, the other goes up. If one goes down, the other comes down. They might change at different rates but they go together monotonically. So people haven't felt bad about just using ROC analysis instead of LROC analysis if they were willing to invest the extra resources because you will lose statistical power.
But people have been willing not to go to this level of complexity and to go to that higher level of complexity requires more elaborate models, more elaborate assumptions. These are still debated until today. You can see in the SBIE handbooks that people are debating this back and forth, Charles Metz and Dave Chakraborty.
But I must mention that a lot of progress has been made in this field. The bottom line of this slide if you haven't followed any of this is that essentially there's a lack of validated software for analysis of such experiments. Now, Elizabeth and the MIPS, Medical Image Perception Society, website actually has software for several of these approaches.
The writers of that software feel very good about the state of their software but there continues to be discussions in the field about how far have they validated. Have they checked whether the alpha level and the reject rate are agreeing and what is the power and so on.
The debate goes on but I expect that people coming down from Pittsburgh any day or any week now saying, "You've got to start using this because it's been validated." That's the state of the knowledge right now. There is software there but there are still people discussing the condition of the validation of the software.
So a few years ago to find some kind of a happy medium Nancy Obuchowski of the Cleveland Clinic and colleagues said, "Why don't we just simplify the task? Why don't we do something called region of interest location specific ROC analysis. Let's only require localization to within a quadrant so you don't have to say there's a lesion here or a lesion here. You just have to say I see a nodule in this quadrant. You require localization only up to a quadrant."
Similarly for the other quadrants you could say, "Why didn't we do it for octants or 16 fold or 32 fold?" Well, you could. This is sort of the entry level, this problem, but as you add number of possibilities, then you get more into questions of overlap and ambiguity so people have decided, "Let's start at the level of just quadrants." As I say, sort of the entry into thus problem.
Continuing on discussing this so-called ROI approach, the location specific ROC analysis, right away Dave Chakraborty jumps into the literature and say, "Wait a minute. This doesn't correspond at all to the clinical task." People have debated that back and forth whether it does or not.
But from the other wing of this Greek chorus comes the methodologist to say, "Yeah, it may not be quite right but it's really straightforward to account for correlations without getting into these assumptions that people have debated for a while."
What do I mean by that? Here are four quadrants, the right side of the lung, the left side, the top, and the bottom if you will. Whatever is going on in this quadrant is expected to be correlated with what is going on in this quadrant, or at least could be, and similarly across the quadrants.
After all, this is the same person, has the same genes, experienced the same environment, and had a picture taken with the same imaging system. One has to allow for the possibility that these quadrants are correlated. The nice thing is that Carolyn Rutter and others came by another year later and said, "Wait a minute.
All you have to do to preserve those correlations is when you resample you resample on a patient basis. You can't start resampling products this one from this person and this one from that person. You have to resample on a patient basis so if I sample you, all four quadrants from you come into that sample and so on. When you do this, you actually preserve the correlation structure and you are said to be using the patient as the independent statistical unit here.
Well, that's all I'll be saying about location specific score keeping and now to one of the real problematic issues in the submissions as we'll be seeing in the next couple years. This is the problem of uncertainty of truth state. There's a classic paper that all of us have almost memorized by now from Revesz, Kundel, and Bonitatibus 20 years ago.
This is Harold Kundel known to many of us as one of the pioneers of this field, the mentor of someone on our panel today, who was at the Temple University, and now is at the University of Pennsylvania emeritus. These authors, what did they say? They included various ways of obtaining panel consensus truth.
They actually did a study comparing three different ways of doing chest imaging and they had the truth but they set the truth aside. They said instead of depending on the truth to keep score, let's get a truthing panel. What they found out was they had several ways of obtaining consensus from that panel. They could either use unanimity. They could use majority. They can use some kind of expert review. They have three or four ways of reducing this panel to truth. They compare three imaging modalities, as I said, and here's what they found. Any of the three imaging modalities could be found to out perform the others depending on the rule you used for reducing the panel to truth.
So this sobers a lot of us in the field about using a panel as truth. However, today the target of this experiment we'll be discussing today is not to say this is a nodule that is a cancer. It is only to say this is a target. This is a region that a panel of experts would consider to be an actionable nodule.
We're not trying to keep score based on the truth. We're trying to keep score based on what would a panel of experts do? Would they cue this region or not? Nevertheless, even though we changed the target, this classic reference above tells us that there's going to be additional uncertainty because of this panel. The panel will have variability in it and if you go to RSNA over the last few years, you'll hear papers on this subject.
What we've said to incoming sponsors is that we strongly encourage you to resample, to come up with some resampling schemes to resample the panel to get a feel for the additional uncertainty that comes into this problem over and above the MRMC paradigm, over and above due to the fact that there is noise in the panel. You can start to see why there is no canned software to do this problem.
Well, since the truth is uncertain, it turns out that leads to uncertainty, in effect, in the number of samples you have. Let's talk about designing an experiment for a moment. Suppose you want to design experiments that are going to have very tight error bars on the sensitivity. Everybody know that if you want to do that, you want to have a lot of actually diseased cases to tighten up the error bars this way.
If you want to tighten up error bars the false positive way, you wouldn't have a lot of actually non-diseased cases. If your endpoint is the area under the RC curve, what distribution should you have between nondisease and disease cases? Well, it turns out it should be some kind of average between the two. It turns out that the number you should be using is the harmonic mean of the numbers in the two classes.
The numbers in the two classes is going to depend on the panel, right? Because some of the panel members will say these are diseased and others will say these are diseased. The actual number of diseased cases depends on the panel. We have uncertainty in truth that leads to uncertainty in the number of samples.
This is almost a trivial curve and I'm just going to tell you about the highlights because we think it might factor in today. Suppose you are told you can design an experiment with 100 patients. You say, "How should I distribute them?"
Well, you distribute them, let's say, at the beginning of an experiment like this so that you have 20 that are actually nodule containing cases, 80 non-nodules, 20 nodule containing sites so we have an 80/20 break.
This effective number, this harmonic mean of those two numbers, is 32. Whereas if I make a more even split, 60/40, 50/50, for 60/40 it would be up in the 40s the effective number. On a 50/50 split the effective number of samples for that experiment would then be 50. That's not surprising.
The reason we're showing this is suppose you start out with an experiment like this and you are requiring unanimity in the panel to declare a nodule-present. Then suppose you relax that criterion and say instead of requiring unanimity, we'll just require two out of three. Then you expect that whatever the number was before you're going to move up this curve.
So you are sampling variability, losing power, but gaining samples. You may tend to cancel. We don't know this. We are speculating about this. We'll discuss this. What I just said is if you want to get into the realm of resampling your panel, you could start by relaxing the panel criterion from unanimous to majority and there are several other ways of doing this.
This is just, again, an entry level. When you do this, this gets you into the game. This allows you to resample, to assess the variability, but it may also increase the effective number of samples. These effects may tend to cancel. This is, again, speculation just based on the direction of these effects.
The last thing I want to talk about today is the problem of controlling for reader vigilance. When you do an experiment, with my two little pads of paper here, when you read in the unaided reading condition versus reading the aided reading condition, there are some people in this room who may be competitive.
If you're reading in the unaided reading condition you say, "The computer is about to tell me what it thinks." If you are a little bit competitive, you are going to say, "I've got to be careful when I read this." You may increase your vigilance.
How do you mock up? How do you do this experiment? This is a challenge that hasn't been quite sorted out. Any measurement setting has an artificial condition compared to the actual real world of practice. What I just described to you is the possibility that some readers might be more vigilant in their unaided reading because they know they are subject to the site.
Well, when you turn a modality lose in the real world, just the opposite could happen, right? The readers might be less vigilant in the real world because they know, "Well, I can brush through this. The computer is going to give me what it thinks in just a minute." In the real world the vigilance could go down. In some experimenters it could go up and I think we've seen experiments when the vigilance didn't change but I'm sure you can guarantee that.
The only thing we've seen in the practical solution to this problem, Heang-Ping Chan and colleagues about a dozen years ago wrote a paper in which they said, "Look, this is a real issue, this vigilance.
How do you do a controlled experiment controlling for reader vigilance?" They said, "Well, just simply control the time available to readers in the unaided reading condition to mimic the actual clinic. That was a suggestion I made. I don't know how many people have tried that yet but that's in the air.
Well, you can all take a deep breath now. We're in the summary. Here we are. This field has been going on for 30 years. In the last 10 years the whole issue of reader variability has complicated it and there have been ways to promote it to address the issue of reader variability.
In the last few years we've had to deal with the complications from location uncertainty, from uncertainty in the truth, this issue of reader vigilance. What we've tried to do is this is like a quadrangle, as I said. We hear it sitting at the FDA and also doing some research here.
We have our academic colleagues doing research in academia, industry sponsors doing research on all these issues in another side of the quadrangle, and NCI and the Lung Image Database Consortium that we've been very actively working with and who are very interested in these issues.
We've tried to hold the windows open so that this quadrangle from all courts has been open to everyone. Whenever industry sponsors have come in with issues like this we've said, "Look, the windows are open.
Here's what is known from all these quarters. Here are the papers. Here are the drafts that are not even published yet. Here's what we know at the moment. We don't have guidance. We can't say this is where the FDA or anyone is holding the bar but this is all the knowledge that we have at the moment."
There is no canned software. There's canned software for little pieces of this problem so any industry sponsor would have to be creative to come forth with a novel way of putting all these pieces together.
Well, that's the state of the world as we know it today. Thank you very much for your interest in this. Oh, there's some papers. The "tz" are obviously Charlie Metz's papers. There are a few papers from our own group in which we have actually worked with Charlie Metz and our own statisticians and our clinicians try to review the state of the world.
This is the first LIDC document. It's going to come out in April. Then in your notes there are many other pages of references.
DR. IBBOTT: Thank you, Dr. Wagner. Before you go too far, I would like to ask if there are any questions from the panel for Dr. Wagner.
DR. KRUPINSKI: What's the consensus? I mean, the quadrant problem gets rid of the localization problem if you end up with a nodule in each quadrant. What it still hasn't addressed, what do you do, for example, when you've got two lesions in a quadrant?
DR. WAGNER: That's right.
DR. KRUPINSKI: You still have that basic uncertainty.
DR. WAGNER: That's right.
DR. KRUPINSKI: The flip side of that is what if there is a false positive in the quadrant along with a true positive? You've just simply squished it --
DR. WAGNER: That's right.
DR. KRUPINSKI: -- into a quadrant and you still have avoided the localization problem and the problem of a false positive and true positive.
DR. WAGNER: That's right. That's been sidestepped. As you know, the higher levels of software attempt to address this one way or another and I think the jury is still out on whether we are ready to use that. I think the inventors of those other methods think they are ready to go and they might be but we also know there are people in the wings saying I'm not sure about these assumptions and so on. That software does not have general providence right now. Maybe that's too bad. Maybe it should be. These are real issues.
DR. BLUMENSTEIN: I'm impressed by the MRMC study design. I think that's a nice step forward. I'm wondering if anybody has ever subjected the same reader to the same image multiple times and studied the effect of that so that you could get at this issue about how a single reader uses their own personal scale?
DR. WAGNER: Yes. That's a classic question. There are experiments on that. I'm making this up but this is the spirit in which I remember it. David Getty has shown some data on this in mammography and I think that readers are correlated with each other in the 60 percent range and are correlated with themselves only 70 some percent on repeats. There is, indeed, a lot of reader variability intro.
However, you get more bang for buck -- if you want to spend so much time in radiology reading-wise, there's more bang for buck to get a different reader than to use the same reader over again because you are so correlated with yourself you get more independent information if you bring in a sample that's not so correlated with the preceding reads. Bank for buck-wise people have said this is a question of reading time. People have not in the MRMC paradigm in general tended to have readers reproduce their readings. You can do it and there are terms in the model to accommodate that, of course. It's just not common.
DR. BLUMENSTEIN: Actually, you took my question as a suggestion maybe of changing the study design. I didn't make it clear. What I'm actually concerned about is whether the methodology that's been developed to give p-values, estimate variance, which you rightly point out are the big issues here, whether those properly account for intra-observer variability in their use of the scales?
DR. WAGNER: I believe it does and I'll tell you why. The full model has seven terms. I won't take you all through all of those seven terms. Pure case, pure reader, various interactions. One of them is a three-way interaction between modality reader and case.
That's the sixth term. The seventh term is what you're talking about. It's the lack of reader reproducability. If you do enough experiments, you can identify so-called in statistical language. You can separate these two. If you don't do the right experiment, you can't but they get lumped together. The term you're trying to get at is the reader inconsistency. That is sampled in the experiment but it cannot be identified. It cannot be broken out but it is in there.
In fact, the way we do it is we do it with a family bootstrap experiment so we can actually put out all these effects but we cannot pull out the MRC from the epsilon. They come together. That represents not only this three-way interaction but represents the inconsistency of all the data sets together. So that is actually in there. Are you surprised?
DR. BLUMENSTEIN: No, no, I'm not. But since you don't measure that in the experiment, you can't estimate it obviously. That's the issue. I guess what I've been concerned about ever since I first heard about the use of ROC curves where the reader is recording their result on a subjective scale either categorical or probability or whatever it is. It's a device to get you to the point of being able to use ROC methodology. What has always concerned me was that there was this underlying source of variability that wasn't taken into account in the models that you are estimating. It's only if you do the experiment that way that you actually get an estimate of that intra-observer or whatever you called inconsistency or whatever.
DR. WAGNER: Right.
DR. BLUMENSTEIN: I just wondered whether the degree to which this has been studied in actuality.
DR. WAGNER: Not very much because of the bang for buck point. As you can see, if you are inconsistent with yourself, and everyone is, that will show up in case to case within a given experiment but you won't be able to peel it out but it's in there and it's accounted for in the inference. It's a subtle point but we can discuss it.
DR. TRIPURANENI: That was an excellent presentation, Dr. Wagner. We used the MRMC for the intra-observation. If you are looking at two different modalities such as a chest x-ray or a cat scan, have you looked at whether there is any difference in the intra-observation between one modality to the other modality?
DR. WAGNER: It turns out to be a really neat point actually. Our own group has three papers on this subject. In the first one, you want to know if you can see the difference in the variance structure between the two modalities. Is that what you're asking?
DR. TRIPURANENI: That's right.
DR. WAGNER: There's a model that has six terms. We were just talking about that. There another model that -- you would think you would have to go to 12 terms to do that. It turns out there is a parsimonious way to do it with just nine terms but two ways to do that.
When you do it you find out that the extra issues brought up by the wrinkles you were just discussing, they come in in such a way that they average and it's only their average that goes into the inference so you can forget about the issue. It's a really interesting issue. We have two papers on it. But you could forget about it. You could from right off the metro just hear about this and say, "I'm going to use the DBM software." You could forget about the difference in the variance structure across the competing modalities and if you do, the inference is still the same inference. It doesn't matter. It's a really interesting point.
DR. IBBOTT: Dr. Solomon.
DR. SOLOMON: How do you -- I mean, I have a feeling this topic is going to be discussed throughout the day but how do you translate changes in ROC curves into clinical significance? Especially since if you look at an individual's change in the ROC one person might do worse and another person might do better and then how do you make that determination?
DR. WAGNER: Right. Well, you might have been a fly on the wall in many meetings. I mean, this is a real issue. Dr. Sacks will say something about it later on. All I can tell you is that the most statistical powerful method to get at these differences is the one I've discussed today.
We really would like -- well, I take you back to the Yulei Jiang stuff. We really do want to see those action items. You can't go from the curve easily to the action items if you haven't measured those action items. Is that what you're getting at? I'm not sure I see what you're getting at.
You want to know how we can go from this ROC summary and inference to an interference to the clinic. Is that where you're going? I think it's difficult. What we're saying here is what we are doing is we are making a measurement that averages over all these variabilities that we have talked about. It averages over all that and here's the summary.
If you want something more clinically relevant than that, you would have to actually measure the action item, the dichotomization, if you will, and give it error bars. When you finish the problem is here would be the action item sensitivity specificity for the one modality and here it would be for another one or this way. Now, what do you do?
Suppose they go this way? What are you going to do at this point if they don't match up sensitivity wise or specificity? What are you going to do? There are things you can do but you have to start getting into expected utility analysis. I didn't mention it but I have some very strong professional opinions on this.
I think it's impossible to do that because to do the expected benefit analysis you need to have an idea of the prevalence of the disease and that changes from risk group to risk group so that is a big uncertainty. You have to have a sense of something called the utility matrix, the number of false alarms that you are willing to trade for a hit, if you will, different from the positive predictive value.
You have to have a sense of that utility matrix and you have to actually know the ROC curve already because all these things come in. I think this is almost impossible to do without this being taken on at a national level.
You can see from the data of Beam, et al. each one of these people thought that they were working out the optimal operating point and have completely different points of view. What I'm saying is that's an important question. I think it's a societal question.
I think it's very complicated and it calls for a lot of wise people with a lot of data to sit down with professional societies and say, "Where are we and were do we want to be?" This is a really big issue. I don't have an easy answer. I insist to my colleagues there is not an easy answer.
DR. IBBOTT: Brent.
DR. BLUMENSTEIN: I think it is the key question. What we are asked to do here is to basically judge whether this difference in the area of an ROC curve --
DR. WAGNER: That's right.
DR. BLUMENSTEIN: -- has any translation to the clinical setting. What we're lacking we have a measure of the significance of the difference in the area of the ROC curve. What we don't have is a measure of uncertainty around the clinical interpretation of the ROC curve.
This is what is particularly bothersome to me is I don't know how to do that and I don't see any methodology that gives me that answer. I'm concerned that we have started building a building with a foundation using subjective scales to measure things so that we can use ROC methodology and we are using resampling methodologies to do this.
We're not taking into account all the various sources of variability and so forth so we are way out there and our foundation may be collapsing and not giving us what we need with respect to the clinical outcomes.
DR. WAGNER: Well, if this was broadcast on academic TV today, apoplexy would abound in the community because we all feel we are building, as you say. We're building on decades of people trying to measure complex perceptional phenomenon. This is where we are right now.
It may not be the ending point to which you would like to be but this is about the best of where we are at the moment. I tried to challenge you a moment ago if you wanted to work on any action oriented clinical endpoints, I think it's very difficult to sort that out.
It's very difficult because you'll get bigger error bars and it's very difficult because the expected utility problem is one that every person in this room has a different answer to that problem. I think it's very difficult. I agree with you that we are constantly besieged by our clinical colleagues who would like to have better answers to this problem.
One case which is kind of unambiguous is the Yulei Jiang's data that I showed you had an ROC curve that went up. The unaided condition was lower. The action item, the dichotomization went from a certain sensitivity to a higher sensitivity and a lower false positive fraction.
I think everyone loves that scenario. Wouldn't you say? That's the world we want to live in. Right? That doesn't happen a lot. These more ambiguous things happen more often. So what we can do is average over the relevant parameters and say this is what we found.
In principle if one ROC curve is higher than the other, in principle one can operate at a given false positive in one modality and increase the sensitivity. For every time B is higher than A, if the specificity is here and the curve is everywhere higher, in principle I can operate at a higher sensitivity. In practice how to do that, wide open. This is a professional society issue that is bigger than all of us. That is a really tough question. I agree.
DR. BLUMENSTEIN: And just to throw one more complicated issue into all this is that a lot of this stuff that you presented here assumed that the modalities were assessed independently. In other words, modality A versus Modality B but the experiments that we are asked to look at are modality B added to modality A.
DR. WAGNER: Right.
DR. BLUMENSTEIN: Where the experiment itself has built-in constraints with respect to how one behaves in doing that. I don't see that taken into account.
DR. WAGNER: No.
DR. BLUMENSTEIN: And I'm concerned about that.
DR. WAGNER: This is a point of confusion. I would disagree with you. The modality A here is the reader unaided. Modality B here is adjuvated, the reader aided by the computer aid. This a standard paradigm and it actually corresponds to an experiment in the real world that you would like to do.
It may not line up exactly with the clinical setting but you actually would want to know something about the performance of readers unaided and then you want to know about how they would perform in the aided condition. That is actually the comparison of interest.
DR. BLUMENSTEIN: I realize that but the way in which the data are recorded is such that the judgment -- as I understand it, the judgment under A is there and has never backed off. You could only improve.
DR. WAGNER: Oh.
DR. BLUMENSTEIN: And that's not taken into account in any of these models that I see. All the models that you presented, everything that you said, is based on having an independent assessment of the two modalities.
DR. WAGNER: Well, you have also touched on something that we have had a lot of discussions on. These are real issues. I'm not making light of anything you're talking about here. One hopes the day will come when these modalities are really good. These computer aids are really good and then you'll be allowed to back off. You could depend more heavily on the modality.
Today people are being encouraged not to back off but the measurement doesn't require them not to back off. They are just encouraged, "Do not back off," and there is a basic reason for that I think Dr. Sacks will explain later on so people are encouraged not to back off.
But when the systems are really good as they are in mammography, these computer-aided systems in mammography are almost flawless for picking up clusters of microclassifications. They are far from perfect for masses but they are almost flawless for microclassification clusters so readers have thrown away their eye loops, a lot of them that are using these systems so they are willing to depend on the computer.
I'm just giving you the only anecdotal evidence. You have a really good point. I don't have a really good answer to it but in principle it doesn't have to be this way. At the moment it is this way.
DR. IBBOTT: I would like to remind everyone we will have time to discuss this specific proposal in front of us later on this afternoon.
DR. STARK: May I ask a question exactly the point of the presentation, I believe?
DR. IBBOTT: Yes, please.
DR. STARK: Using the classic -- thank you. That was an outstanding presentation.
DR. WAGNER: Thanks.
DR. STARK: Let me just get to the point because I know we are running short on time. With a better test the AB test in come context in terms of clinical utility, either one that had less scatter. You showed the Beam paper where the radiologist skills cause scatter in the distribution of the family of curves.
It would seem to me that there would be two criteria applicable here where we have a different choice where the test with the larger Az is not the better test if that test is less flexible -- I'm sorry, has a larger scatter in terms of variability of radiology performance, radiology implementation creating a management problem, the implementation problem and then the clinical utility problem where all of the fabulously sophisticated group here are focused on.
The other area where the larger Az -- so if there is more scatter in the test with the larger Az, it will likely be an inferior test, more cumbersome, more costly, less safe and less effective in clinical utilization.
The other thing is that if there are two tests with comparable scatter but is easier to train with experience or inexperience, so if you have a trained panel of readers like you do under these study conditions under very circumscribed conditions where they know they are in a test and are not distracted by clinicians, by the busy realistic environment of all mammography or chest CT practices, you can have a curve that is more pliant in the direction that you want doctors to either start at with distractions or to move into with experience so it does seem to me that the scatter or the flexibility of the performance.
The ROC curve I think is unassailable and I have learned -- I have enjoyed a ton here learning from Dr. Blumenstein's analysis, yours, and those of you have seen whatever I wrote here. My group had to do this 20 years ago. We published papers on ROC analysis and I know we're on the right -- I believe we're on the right foundation.
I think this is the right place to start but the breath of the challenge facing us all here today is let's not get obsessed with the ROC curves. I know we have the whole day for this but the safety and effectiveness of this is going to be what happens when you drop into a clinical environment.
And we have a lot of experience with breast and this panel has a lot of people experienced on it but can you tell me if you would agree that we need to see the scatter in these Az plots and know how they respond to inexperience or training to really know of the larger Az is better.
DR. WAGNER: Well, I would say that I think there is a little bit of second order phenomena here that is important. Just because something is second order doesn't mean it's not important. For the practical inferences that have been -- the endpoints of studies we've seen to date, it has been the performance in the mean.
People have addressed that. There is software. We have several papers on how to do just what you say and how to split out every piece so we can see how much variation is from the cases, from the readers, from the various interactions. There is actually software to do that and we are encouraging people who operate at a higher level, say NCI or some academic consortium, to address these very issues and we can see it. We know how to peel all this stuff apart. As far as the inference on the table today, it was not done.
DR. STARK: The burdens would be huge. I mean, the sample sizes, the whole time period, the number of people that have to be involved.
DR. WAGNER: That's right.
DR. STARK: That's why you talked about the need for national studies and we would all like to do that in oncology and everything but we have to treat people and make decisions today.
On the other hand, let me ask my final question. Are you aware, or is anybody aware of any evidence that a p-value or some other statistical measure comparing your test A, B under whatever conditions, today's conditions or the ones I am dreaming about, we hope it has some clinical relevance but couldn't it all be counter intuitive? I mean, this is a very subtle business and couldn't we be missing the forest for the trees here?
DR. WAGNER: Again, that's a very wise question and I think that is why we have several medical officers involved in our center on the panel here so I'll defer to them.
DR. STARK: So the p-value of .003 doesn't necessarily mean a thing.
DR. WAGNER: I defer to my clinical colleagues for that.
DR. STARK: Thank you.
DR. IBBOTT: I want to make sure that we give Dr. Mehta a chance to ask a question if he has one. Dr. Mehta, do you have any questions? He may not be able to hear me.
DR. MEHTA: No, I don't have any questions.
DR. IBBOTT: Thank you.
All right. We are a few minutes ahead of schedule at this point so we'll take a short break. Let's make it 10 minutes and we back at 10:50.
(Whereupon, at 10:40 a.m. off the record until 10:55 a.m.)
DR. IBBOTT: Take your seats, please. I'd like to continue the panel now if you will take your seats, please. For those of you who are like me are concerned, we are getting the heat turned down in this room. At least in one sense.
We will now proceed with the sponsor's presentation which will be introduced by Dr. Kathy O'Shaughnessy who is Vice President of R2 Technology. Dr. O'Shaughnessy.
DR. O'SHAUGHNESSY: Thank you very much. Dr. Ibbott, we are very pleased to be here today to present our image checker CT CAD software. I would like to introduce the attendees that are here from R2 and some consultants that we have come to -- we have asked to be here today to both present and answer questions from the panel.
Besides myself from R2 Technology there's Dr. Castellino, our Chief Medical Officer; Dr. Wood who is the head of our CT Products group; and Mr. Schneider who is the lead algorithm architect that designed the algorithm that we are reviewing today.
In addition, we have asked the following people to join us. Dr. Delgado was a beta user of the system so he can describe a little bit about his experience using the system at his facility. Dr. MacMahon is a thoracic radiologist from Chicago with extensive experience in both CAD and ROC research. Mr. Miller is a biostatistician for the study. Dr. Stanford was one of the site investigators where we collected cases from one of the sites.
Here is a brief overview of our agenda. After my introduction we'll go into the current clinical practice for some background on lung CT and, in particular, the detection and management of nodules and lung CT images. Then we'll describe the device both in terms of how it works and how the user uses it.
The clinical study will start first with how we collected the cases that were used and then go into detail into the methods and results from the clinical study. After that we'll have a brief discussion, presentation about the beta test that describes a little bit about the usability of the system. And I'll finally summarize.
Before we move into the presentation, I wanted to put out our proposed indications for use of this device. I thought it was important to go over this to sort of put what we are presenting today in context. The image check for CT is a computer-aided detection or CAD system designed to assist radiologists in the detection of pulmonary nodules during review of multi-detector CT scans of the chest.
It's intended to be used as a second reader alerting the radiologist after his or her initial reading of the scan to regions of interest that might have been initially overlooked.
I would like to ask Dr. MacMahon to come to the podium, please.
MR. MacMAHON: Thank you. Again, I'm Heber MacMahon. I should say I have a small equity in R2 Technology. The company has also paid my time and expenses for this meeting.
I would just like to make some brief comments about the actual clinical practice of radiology as it relates to thoracic CT scans and the importance of detection of pulmonary nodules.
Some of the common indications for performing thoracic CT scans would include characterization of an abnormal finding on a chest x-ray. In this situation an abnormality may have been detected and the purpose of the CT scan would be to characterize it as possibly a lung cancer. And in addition to detect additional abnormalities that might be relevant such as metastatic nodules.
We also used thoracic CT scans extensively for staging and monitoring lung cancer and other kinds of tumors. In this situation we are looking not only for pulmonary nodules, but also for enlarged mediastinal lymph nodes and upper abdominal abnormalities.
In the case of extra-thoracic tumors we are commonly also looking for pulmonary modules and for enlarged lymph nodes in the mediastinum. Then there are a range of other applications of thoracic CT some of which are developing and will be used more extensively such as detection of pulmonary embolism. However, in all these situations, although the pulmonary nodules are not the primary focus of the examination, there is an opportunity to detect pulmonary nodules that may be present in the lungs of these patients.
Finally, lung cancer screening which is investigational and depending on the outcome of the ongoing NLST study may be used more widely. And, of course, in lung cancer screening pulmonary nodules are the main focus of the investigation.
But the point I would make is that lung nodule detection is a requirement in every chest CT scan no matter what the original clinical implication. Only when the radiologist has detected a nodule can he or she decide what course of action is then appropriate.
There are various management strategies that can be used to manage a pulmonary nodule. In order to determine whether it's an actionable nodule, we need to consider the size. Generally larger nodules are more dangerous and more likely to be cancerous.
We consider the shape whether it's spiculated, ground glass, and so forth, whether there's been integral change from a previous examination in the same institution and that would be part of the normal diagnostic process to make that comparison. We would consider, of course, the clinical context, the age and gender of the patient, smoking history, and so forth. There are a number of factors that play into that decision in addition to the image itself.
If the nodule is considered actionable, we can recommend a number of courses of action. One of the most common would be to obtain outside prior imaging studies from other institutions. If we can establish stability over a period of time, no further action may be necessary.
Follow-up CT scan might be prudent at anything from three months to 12 months depending on the nature of the nodule and the radiologist level of suspicion. Other kinds of imaging studies such as a PET scan may be applicable, especially in larger nodules that are in the range of 8 to 10 millimeters. This may distinguish cancer from a benign nodule,
Finally, we can consider biopsy, either transthoracic needle biopsy, bronchoscopy, or thoracoscopic resection.
Just to illustrate the clinical problem, here is an example of a very small pulmonary nodule which I think might easily be overlooked in clinical practice. It's almost indistinguishable on the single section from surrounding blood vessels but this is, in fact, a small lung cancer which was detected one year later, as you can see, at which time it is much more advanced.
So this is a very challenging problem for radiologists to visually attack these very small nodules and CT scans. We are aware that we do miss nodules and I'll just cite two particular studies of interest that have addressed this issue of missed nodules and CT scans.
Dr. Hartman and others at the Mayo Clinic looked at over 1,000 screening CT scans and compared them with prior screening CT scans one year earlier to see how many nodules may have been overlooked. They found that as many as 24 percent of the prior prevalent scans had nodules that were not recorded at that time.
This might seem an astonishingly large number but this is consistent with some other studies. Now, a large number of these nodules were relatively small put more than one-third of them were about three millimeters and in the size range where they are likely to be considered actionable.
And, in fact, 6 percent of them had grown which would mean that they were highly suspicious for lung cancers so there seems little doubt that nodules are being missed even in excellent centers such as the mayo clinic in a study that was focusing specifically on the detection of nodules.
One other study performed by Gruden and others at Emory University looked at 25 patients with presumed lung metastases. These patients had soft tissue sarcomas and melanoma and they established truth by consensus which is a practical method using five readers. These nodules were three to nine millimeters in size and they were solid nodules. Two to nine solid nodules in each case by consensus.
They found that the miss rate for individual readers ranged from 20 percent to 39 percent of all of the nodules in this size range. This was in an observer test setting where the readers were focused on detecting nodules and presumably had no other task in mind so one would expect a relatively good performance in that situation.
So between these two studies we can see that there is a considerable problem with oversight errors in reading CT scans. Now we have a trend towards thinner CT sections with the newer multi-detector scanners. This allows improved ability to detect and characterize lesions. It does allow us to do a high quality off-axis reconstructions.
On the other hand, it does present us with more image data, more opportunities for error. In a chest CT scan performed with a multi-detector unit we may have anything from 18 to almost 300 images of the chest and the radiologist has to interpret those visually.
I think that the evidence that we've seen strongly suggest that traditional visual interpretation is no longer sufficiently reliable for detecting these very small and potentially dangerous common nodules.
At this point I would like to introduce Ronald Castellino, Chief Medical Officer for R2 Technology.
DR. CASTELLINO: Thank you. My name is Ron Castellino. I'm also a diagnostic radiologist but currently I'm the Chief Medical Officer of R2 technology.
At the outset I'd like to particularly emphasize the definition of computer-aided detection which is also called CAD as we will be using it in the presentation today. Computer-aided detection as we use it refers to the availability of computer algorithms that automatically identify regions of interest on a medical image for the radiologist to evaluate.
It's purpose, of course, would be to decrease what I would term observational oversights. That is, findings that are present on the image but, in fact, are not seen by the radiologist. This is not a device to tease apart very unusual nodules that might not be present or barely present on the image. These nodules are actually clearly visible on the image.
The image check for CT CAD system specifically is designed to automatically detect regions of interest with features suggestive of solid pulmonary nodules on CT exams of the chest. It's important to remember that it is to be used as a supplemental review. That is, after the initial assessment has been made by the radiologist. It is not a first reader.
The radiologist, most importantly, remains responsible for the final interpretation of the findings that the CAD marks may put on the image. That is, to determine if the mark is actually a true mark or if it is a false mark.
A brief review of the device description. The CT scan is performed in the standard fashion. The images or the data set is moved to increasingly types of work stations that radiologists review the images on and what is what we call a soft copy display. These images may be reviewed slice by slice but increasingly they are reviewed in some type of a melt-through or a cine mode to facilitate reviewing these hundreds of images that are generated.
By the same DICOM standard the data set can also go through a server computer. Various image analysis algorithms can be put into place. In this case, I point out segmentation. This type of information can also be transmitted to the work station to help the radiologist further analyze the images and this is an image checker CT work station which was cleared by the FDA in 2002. This is an existing product that has been cleared.
The same DICOM data set can also go through an image checker CT CAD software system and provide on the work station CAD information as well. It is this specific piece of the product that is under review today by the panel.
I'll show you a few screen capture images of the front end of the work station on which the CAD marks are displayed. The view port on the right is familiar to radiologists. This is where we can see the axial images. I guess I can't use this thing. Thank you. We are a high-tech business as you can see.
There we go. On the large view port on the right we can see the axial image displayed to the radiologist which is viewed either singularly or, like I said, melt-through a cine mode. The smaller view port on the upper left is a three-dimensional reconstruction of the contents of the lung.
You can see the pulmonary vessels. In fact, a few nodules perhaps you can see there. And the horizontal lines simply indicates to the radiologist what level on the image the axial image is displayed. We see a nodule here quite clearly in the right apex.
The radiologist then will move down the entire sequence of the lung in the lung windows looking for other abnormalities, nodules as well as a multitude of other features that the radiologist searches for sometimes seeing nodules and sometimes not seeing nodules.
When they completely review the entire study, which I'm giving to you in a very schematic fashion here, the radiologist then will activate with a mouse click the CAD button we call the R2 button. At that point in time the CAD process takes over and presents the following.
The circles indicate candidate nodules that the CAD system has identified shown to the radiologist on the three-dimensional display of the lungs, as well as brings the radiologist automatically to that specific site where the nodule is best seen by the CAD system.
In addition, out other view port on the lower left is shown. This is a three-dimensional reconstruction that can be rotated to separate the nodule out from adjacent vasculature. I would like to emphasize that upon the CAD review the radiologist need not go through the entire data set once again but simply by moving and hitting one of these little buttons here with a mouse click which you can't read here. It automatically jumps the image. By the way, the size is automatically shown as well.
It automatically jumps the image to the next CAD detected nodule and the next and so forth. For example, this nodule, as I showed you and, for example, a nodule at the right base which is clearly a nodule but, in this case, had been overlooked by the radiologist on the set of images.
That is the CAD display on the work station. What does the CAD search for? It is specifically designed to search for solid lung nodules that are 4 mm. or greater in size and we find that further as follows. They should have an approximate spherical shape.
The margins can be smooth, lobulated or spiculated and should have soft tissue density which we define as having average density of minus 100 Hounsfield units or greater. Some of the typical CAD marks you've seen already. They circle the nodule. We consider this a true mark if it actually encompasses the size of the nodule sometimes quite small, moderate in size.
I would like to emphasize that also although we look for spherical nodules if, in fact, the nodule is adjacent to a plural surface where a portion of the sphere is obliterated by contract with the plural surface. The algorithm tries to find these as well.
Secondly, this image perhaps some of you can see, although it is easier for the radiologist and the CAD system to detect a nodule that is surrounded by completely normally aerated lung, if there is adjacent modest non-aerated lung as we see here in the appended edema, the CAD algorithm often is successful in teasing out the nodule as well.
There are a multitude of other parenchymal abnormalities within the lung tissue that the CAD algorithm does not search for. The radiologist must look for these but the CAD algorithm does not search for. For example, linear strands which do not fit the criteria. I would like to point out importantly although this fits the criteria of being a spherical nodule, we call these ground glass opacities.
They are increasingly noted to be of importance, particularly for lung cancer screening programs that because of the Hounsfield density cutoff that we have, this type of nodule currently is not searched for with our set of algorithms.
All CAD systems have false marks. We see a few here such as this one here where a branching vessel exist. The CAD algorithm thought this was a nodule and marked it incorrectly. Plural tags are at times marked incorrectly. I can tell you that our experience internally as well as with users indicate that the vast majority of these false marks can be readily dismissed as you see here.
As an aside, we have found that a regulatory database a median of three false marks per exam. I would like to emphasize this is per exam. There is a median of 160 images per exam so we're talking about approximately one false positive mark for every 50 to 55 individual images.
Now, the clinical study was designed around an ROC study as you've heard from Dr. Wagner. It was done in close collaboration and support with the people from the FDA. The ROC study in a large extent does measure -- a combined measure of efficacy of safety. There is some discussion about that and Dave Miller will fill you in on that as we see it, at least.
There are three parts. We've collected cases. I'll review that. These cases were sent to a reference truth panel and finally to the MRMC ROC study which you'll hear about from Dave Miller.
I would like to spend only a brief comment upon the target of nodules. You've heard from Dr. MacMahon that we are increasingly seeing smaller nodules on our CT scans and our clinical practice. We wanted to design a CAD system to help radiologist detect all solid nodules between 4 and 30 mm. That was the focus of our research effort.
And, as you are well aware, those in the clinical practice you will recognize that most lung nodules most of the time are typically sampled by biopsy or thoracic resection if they are 8 or 10 mm. or so greater in size. There are obviously exceptions to this but, in general, they are.
The availability of a biopsy proven so-called gold standard to evaluate nodules in this smaller size range was just not available to us. We settled on a gold reference standard of a consensus on actionability as being the only practical standard that would capture all solid nodules of clinical concern in this size range. We are really focusing and trying to help the radiologist in the 4 to 8, 10 to 12 mm. range. The larger nodules, of course, radiologist will almost always see.
We collected cases from five centers. They contributed consecutive non-selected cases. We tried to make this as representative as possible. They were all in adults. They were performed for a variety of clinical indications. There were no screening studies in this group.
Cases with greater than 10 nodules were excluded. We felt that there were a multiplicity of nodules. The issues of searching for nodule where the radiologist has already seen 8, 10, 12, 15 would be reported. The images, of course, have to reach certain technical parameters.
These cases were divided into two categories to begin with by report. The nodule-present cases had in the report the presence of one nodule or more described by the reviewing radiologist. These patients by definition had a history of biopsy proven documentary cancer either primary to the lung or in an extra thoracic site.
We did this to try to increase the likelihood that nodules in this group might have clinical significance because they were in patients with cancer but I would like to point out that the specific nodules themselves were not biopsy proven. The nodule absent cases, once again by report, no nodules were described within the context of the report. These patients could have a history of cancer or not.
The final truth was determined by the reference panel which you'll hear about from Mr. Miller. Five sites contributed to the study. Three of these are community imaging centers, two are university centers. They were from the east coast, mid-west, west coast. There were 63 cases that had nodule-present by report, 88 nodule absent by report.
You can see the distribution between male and females were similar. The age range was similar in the two groups. There was a slight increase in median age in the nodule-present cases perhaps because they all had documented histories of cancer as compared to this group. The type of cancer in the nodule-present case, 38 percent had a documented primary lung cancer and 62 percent had documented extra-thoracic primaries.
Here are some of the parameters of the technical aspects of the case characteristics, the median number of slices you see here. There is a slight predominance of thinner slice sections in the nodule absent cases mainly because one of the centers was doing much thinner slices routinely and they contributed a larger amount of nodule absent cases.
The CT vendor's use in these five sites were General Electric or Toshiba.
I would like to ask Dave Miller to present the methods and the results of the study.
MR. MILLER: Thank you. My name is Dave Miller and I am currently the Director of Statistical Analysis at Ovation Research Group. At the time that this study was conducted I was the Director of Biostatistics at R2 Technology. R2 is paying for my time and travel. However, I do not have any financial interest in R2 Technology.
Just want to quickly go through an outline of what I'm going to discuss because I'll be up here for a little while. I'm going to go through some definitions that I'll be using during the talk. Then I'll talk about the reference truth panel. I'll talk about the ROC study design, our primary analysis. Then we did a large set of robustness analyses. Then finally the study conclusions.
So gold standard, and these are definitions that I'm going to use. They are not necessarily dictionary definitions of these but gold standard is something that I'll define as an objective and definite measure of truth.
The reference truth is a truth standard for a subjective construct. It is a term that is fairly widely used and it's a term that I'll be using here as a standard that's used in lieu of an available gold standard. The kind of thing that reference truths are used for are things like actionability where actionability is something I'm defining as a subjective point-of-care decision which is really what we're targeting with actionable nodules.
Nodule also is a subjective definition. It's a subjective characterization of a lung abnormality. Finally, a panel is a group of radiologists with a given task. In this case, their task was to identify and characterize actionable nodules. Consensus is a term I'll use only for unanimous agreements. When you hear we use consensus, that means unanimous agreement as opposed to majority agreement.
Then, finally, a few study definitions. I'll run through these very quickly because you've got a very nice tutorial from Bob Wagner this morning. The ROC curve is the receiver operating characteristics curve. AZ is the area under the ROC curve, the measure of interest in the study.
MRMC stands for multi-reader, multi-case. I'll use the term primary analysis for our protocol specified primary analysis and the term ANOVA-after-jackknife. The ANOVA there is analysis of variance and you've got a nice description of both the jackknife and the bootstrap earlier.
So under the reference truth panel the goal of the reference truth panel was to fully identify all nodules in the case sets. These are the cases that Ron described how they were collected. We wanted them to rate the actionability of any nodules that they found. Specifically we are defining actionable as a nodule that requires surveillance or intervention so it could be follow-up or it could be more of an intervention.
We define the reference truth so that we could use it in the ROC study. The method was to have a panel of three radiologists independent review the cases and we followed a two-path process to reduce observational oversights.
The reference truth panel qualifications were that they needed to be board certified radiologist, that they had at least six months of reading thin slice which we defined as less than or equal to 3 mm. collimation CT of the chest, and they needed to have experience with reading soft copy.
A total of 11 panelists participated in at least one of the three-member panels that were convened. Just to be clear, we didn't have a single three-member panel because it just would have taken weeks for three people to review the set of cases that we had. We had a succession of panels and there were a total of 11 different panelists that participated in at least one of those panels.
Nobody participated in more than three and obviously nobody participated in less than one. This is how the panels worked. We brought the radiologists in and we put them in three different rooms. This is after a brief sort of training that we gave them prior to going to the three different rooms. They had three different work stations set up and they each independently reviewed a set of cases. In a typical sessions we had about 20 cases reviewed.
After they had reviewed all of the cases for a given day, and this usually took maybe four or six hour or so, we took the computer files of all of their findings and these are findings of the exact locations and we brought them together to get the union of all findings so that redundant findings were captured and we knew every finding that any panelist had found.
This is a little hard to see up there but we also at this stage excluded nodules that were less than 4 mm. in size or greater than 30 mm. in size. Those were protocol exclusions and we had asked the radiologists not to spend too much time taking precise measurements as they were doing this.
After this there were 95 findings where three our of three of the panelists agreed that it was a consensus actionable nodule. I couldn't say consensus. Three out of three agreed and, thus, there was a consensus that it was an actionable nodule.
Now, there was also a large set where there was disagreement. Either one out of three or two out of three of the radiologist had identified the finding and the other radiologist either had overlooked the finding or didn't feel that it was an actionable nodule. These went to a second pass.
The way the second pass worked is that after about half hour of prep or so they went back into their individual rooms so they didn't come together and talk about the cases. They each went back to their individual rooms and they had the locations of each of these disagreement findings identified for them. So the second pass went fairly quickly because they didn't need to go through the whole case. They were just looking at and being directed to specific spots and being asked to rate the actionability.
After this there were 47 additional nodules that went into our truth set of unanimous nodules. There was also a fair number that went into what we call the majority group, that two out of three felt that it was actionable, and a minority group that one out of three felt that it was actionable.
Our primary analysis focuses on consensus agreement but we did do some robustness analyses around the majority and minority. I'll be talking about that later but for now I'm focused on the unanimous nodules.
So as a result of this process the eight three-radiologists panels. I told you there was a series of panels. There were, in fact, eight of them. They identified 142 consensus nodules in 65 nodule present cases. You might notice that number 65 is slightly different than the 63 number that you saw earlier. That's because now our consensus panel is the definition of truth for this study.
You can see the size of these findings. The median size was 7.9 mm. and there were a lot of them that were in the 5, 6, 7 millimeter range. The remaining 86 cases were categorized as nodule absent by virtue of not having any of the unanimous nodules in them.
So moving onto the MRMC ROC study, the objective of this study per protocol was to demonstrate that review of CAD output improves performance of radiologists reviewing MDCT with respect to their ability to accurately identify actionable nodules.
Our outcome measures were AzB. That is, the before CAD area under the curve, AzA, that is the after CAD, the area under the curve and, most importantly, Azdelta. This is basically the difference between the two curves. And the hypothesis in a formal statistical sense -- the null hypothesis was that the mean change in the area under the curve was zero and the alternative hypothesis, of course, is that Azdelta is greater than zero meaning the CAD did have a benefit.
The study was conducted in two phases. We first did a 32-patient study and then after doing that study we had some discussions with FDA and we outlined what would be the appropriate methodology to use for a second study, what the appropriate size for the second study would be based on the type of methodology that was suggested. So I'm going to be talking about that second 90-case study as the focus of this talk.
The reader qualifications for the ROC study, so this is, again, new set of readers. Don't confuse them with reference truth panel. Completely different people. It would be wrong to have the same people. These people had reader qualifications that they be board-certified radiologists and have at least three months of reading MDCT of the chest.
The basics of the study is that we have 15 readers read all cases. We had 90 cases. Of the 90 cases 48 had at least one actionable nodule and 42 did not have any actionable nodules and that was based on a stratified random sample of our complete set of cases.
There were, of course, four quadrants per case by definition but the important point is that these quadrants, all four of them, were rated pre-CAD and then sequentially post-CAD. The ratings were finally evaluated against the reference truth so the ROC curves were drawn by comparing the ratings which were on a continuous scale to the reference truth established by the panel.
I want to clarify what the unit of analysis is because I know people have a tendency to want to sort of track the numbers as they go through the slides and see where things add up so, just to be clear, nodules were the unit of analysis for the reference truth. The reference panel was supposed to identify every nodule.
Quadrants -- the quadrant truth was computed from the nodule truth. For instance, if there was a quadrant that had one actionable nodule and one non-actionable nodule, the quadrant was, nonetheless, considered nodule-present quadrant because it had at least one.
On the other hand, if there was a quadrant that had a minority nodule in it, in other words, a nodule that at least one person on the panel thought was a nodule but not unanimous, that was considered a nodule absent quadrant. Every quadrant counted in every analysis that we did.
Now, the reason that we went with this quadrant approach is that the LROC methods were not developed at the time that we embarked on this for multi-read, multi-case studies. I think they probably will be in time and they may even be right now but at the time we began the study, they were not.
Bob Wagner described it a little bit as these being sort of competing fields that people that went with the ROI approach versus the people that go with the full localization. I think really there are two camps that are going after the same thing of trying to get some measure of localization added to the ROC method.
We felt that for this particular case where you might have a nodule that was quite large in one lung and then a smaller nodule in a contralateral, that that smaller nodule in some cases might be the really important one that actually drove the care. We felt that getting at localization in some way was important. We went with the quadrant approach.
The quadrants were rated by the ROC readers but then the case, not the quadrant, is the unit of analysis for the computation of the p-values and the confidence intervals based on the jackknife and the bootstrap. You heard these references mentioned earlier but Obuchowski specifically is the reference for using this region of interest or quadrant approach. Carolyn Rutter is the person that developed the method of using the bootstrap to sample cases.
The reading environment for our study is that readers were trained on work station use and we really tried to create a reading environment that was as similar to their individual practices as possible. So the usual work station controls were available to them. If any individual reader had a particular window or leveling preferences, they were allowed to modify that. We didn't have it in the protocol that they had to read a particular way that would take them out of their reading environment.
They were allowed to practice on three cases with the trainer present. The ambient lighting was adjusted to the radiologist preference. There was no hard time limit.
The instructions given to the readers was to only search for 4 to 30 mm. actionable solid nodules, to rate each case post-CAD immediately after the pre-CAD rating so they had to go through the entire case pre-CAD and provide the ratings before the computer would even allow them to turn on CAD and then provide the post-CAD ratings.
They were instructed to consider age, gender, and clinical indication. These were taken from the radiology report. We did not provide them with the full radiology report as that obviously would have provided too much information for them to be able to make up their own decisions.
So the basic study work flow here -- let's see which of these works. Yeah, this one works. When you saw the work station earlier, there was no blue line. The blue line is separating the upper quadrant from the lower quadrant. We didn't feel like we needed a line to separate left and right. The yellow line is indicating where they are in the exam.
As they were reading the case, they had the opportunity to bring up a pop-up menu to rate the quadrants at which point they would get this little cartoon of sorts with these slider bars. They would move the slider bars either all the way over -- you can't see. There's a little 100 there -- to indicate complete confidence that there was at least one actionable solid nodule present in the quadrant, or zero to indicate complete confidence that there were none.
In this particular case you can see that the reader has gone through and given a pretty low confidence or, I should say, a high confidence that there are no nodules present in any of the quadrants.
Having done that they then have the opportunity to click this button up here and turn on CAD. It's a little bit hard to see here but there is a potential nodule. I'm not a radiologist. I won't tell you whether it is a nodule but it is located there in the upper right quadrant. Then they would have the opportunity to rate the case again.
In this case they might have changed their rating. In the other quadrant since there was only a mark in the upper right-hand quadrant, it's fairly unlikely that they would have changed any of their other ratings but they were allowed to.
So after doing this with our 15 readers who each read the 90 cases, both pre-CAD and post-CAD, were able to draw the ROC curves for each of the individual readers. This is just an example of a single reader and so the area under the dash line is the pre-CAD Az and the area under the blue line is the post-CAD Az and then the area in between the lines is the Azdelta.
These are the 15 pairs of readings. I didn't produce this plot specifically to answer some of the questions that came up earlier this morning but I think it might answer some of them a little bit. Now, this is not the same plot that you saw earlier. This has the pre-CAD area under the curve on the bottom and the post-CAD area under the curve going on the Y axis. So pre-CAD the range was from about .82 up to .96. That's the range of the 15 readers area under the curve. Post-CAD the low end was .86 to .96 so you can see a narrowing of the range post CAD with respect to Az.
In particular, these three readers who had
-- I'm trying to look for a different word than worst -- had the worst pre-CAD Az performance of around .82 to .84 were the ones that improved the most, or were among those who improved the most. You might wonder what about readers that did pretty well. Well, these two readers did very well pre-CAD, at least, measured against Az. And post-CAD they also had some improvement. It was a more modest improvement. They didn't have as much to improve.
Now, finally, there's this reader up here. This reader had a nearly perfect pre-CAD performance. This does just go to .96, not all the way to 1 so they weren't absolutely perfect. What you worry about with a reader such as this is you don't want CAD to cause them to change their impressions so they get worse and they did not.
So moving onto the primary analysis this is the average reader ROC curve. Again, here is the pre-CAD line, the post-CAD line, and the area in between is the Azdelta. I'm just going to focus in on this part right here because it is an important point about whether or not the curves cross.
The curves do not cross and so you can see that they are always apart. Especially in this area here I think is the area where people are most likely to have their individual operating points, although, as you saw, they might go all the way out here.
These are the same 15 dots just plotted against a different axis so this is sort of how far away they were from that line. You can see individual reader improvements ranging from about .06 to zero to no improvement. And then the idea behind the Dorfman-Berbgaum-Metz ANOVA-after-jackknife analysis is to create a confidence interval and computed p-value that would allow us to figure out what might happen with a new reader with a new case.
I mean, that's really the idea of this confidence interval is what kind of performance would we expect from a new reader with a new case. You can see that both the individual readers as well as the average delta and the confidence intervals are well on the side of CAD better as opposed to the side of CAD worse.
Now, we went ahead and did a number of robustness analyses and these were basically about repeating the primary analysis varying different assumptions to demonstrate that the primary results are not sensitive to study design. I think these are very, very important because there is a considerable literature that you can tweak different things and end up with different results. If we had found that, we would have been in a difficult position because we wouldn't have known whether or not we really did have a robust result.
I'm going to talk about this with reference to the statistical methodology, specifically the ANOVA approach versus the bootstrap approach. There are lots and lots of different iterations on this but I'm just going to focus on these two. I'm going to talk about the reference truth. I'll focus on the consensus standard versus the majority standard but there are a number of other reference truths that we examined and I'll just focus on those two.
And then panel variability. I've talked about the confidence interval being a way of getting at what would happen with a future reader with a future case. What you really want to know is what would happen with a future reader and a future case evaluated against a new truth, right?
That means that you don't just have to have the random reader and the random case components of the ANOVA model. You also have to have some way of evaluating your truth against the random panel if you are going to fully capture the variability.
So the ANOVA-after-Jackknife compared to the bootstrap, I'll run through this quickly because you heard this earlier. The ANOVA-after-Jackknife is based on leave one out samples. Again, the leave one out here is cases. A case is being left out of each sample as opposed to a quadrant.
The Az end of the curve has been computed for each reader case combination and then analysis of variance random effects model is fit. This is the standard analysis of variance random effects model with full interactions described by Dorfman-Berbaum-Metz.
The bootstrap, I think nonstatisticians a lot of times find the bootstrap a little bit more intuitive. The experiment is replicated in 1,000 random samples so from our sample of readers in cases, we generated random samples of readers in a random sample of cases and for each sample we matched our random readers with the random cases and repeated the entire analysis.
It is very computationally intensive but it gives you a way of coming up with confidence intervals that allow a nonparametric -- fully nonparametric approach to evaluating what would happen with a future reader in a future case. I do want to point out that the ANOVA-after-jackknife is semi-parametric. The ANOVA piece is parametric but the jackknife piece is nonparametric.
So these are the confidence intervals for the ANOVA versus the bootstrap. You can see that the confidence interval for the ANOVA is a little bit tighter. For the bootstrap it's a little bit broader. One of the things that the bootstrap is known for is being able to come up with confidence intervals that are not actually symmetric about the mean because often there is not really any reason to believe that the competence intervals would be symmetric about the mean. In this case you can see it actually goes out further on the CAD better side. Even though the competence interval is wider, it does not in anyway diminish the results.
So returning again to the primary analysis, the primary analysis, as I showed you earlier, is based on a delta Az of .024 and a p-value of .003. I just showed you a different methodology using the bootstrap and came up with .0246, very close, and a p-value of less than .001.
Then we went on to a different reference truth. The different reference truth that I'm talking about here, and I apologize that it's not on the slide. We didn't want to make it too dense, but this different reference truth is majority so this means that a quadrant would be considered nodule present if there was at least one majority or consensus nodule and it would be considered nodule absent if it did not have any majority nodules in it.
A really important thing to point out here is that the majority quadrants, the ones that two our of three radiologists in the panel consider to be actionable. They are included in every single analysis so that means that when we're talking about the unanimous truth, they go in to the false positive side of things, as somebody calls it.
On the other hand, if we talk about this reference truth, they go into the true positive side. We felt like we don't know if those are nodules or not and so the most conservative approach to take is to always put them in every analysis.
The delta Az here is a little bit lower but the p-value is actually more significant, to use a loaded term. This has to do, I think, with this sample-sized paradox that Bob Wagner was describing earlier. The final step was to do the random reference truth.
We did the random reference -- actually, before I go to that, I want to mention on the different reference truths in addition to majority and consensus, we also looked at a minority reference truth which is sort of the loosest possible standard we could come up.
We also did a tighter truth based on having a second panel of five people look at the cases and define the truth more tightly. In all four of those cases we came up with a similar statistically significant result. So the random reference truth is based on picking two panelists at random to review each case.
Pretend that the three-member panels didn't exist. Redo the truth assuming that third person just wasn't there in their room. When you bring together the first-pass findings, their data doesn't come in. When you go to the second-pass it's only the two out of two consensus. This allowed us to come up with competence bounds that captured that piece of the variance. It ended up being fairly similar, although the delta Az is somewhat diminished from that of the primary analysis.
So all variations gave statistically significant results. I'm a statistician so that's what I know best and that's why I'm best prepared to talk to you about. I take the point of some of the panelists that -- by panelists here I'm referring to you all as opposed to any of our other panelists.
You want some sense of what does it all mean. What does this Azdelta of .02 mean? For myself, I find it useful to think about individual operating points. This is the pulled curve where we pull all of the readers together. You can't really translate this to a new reader and a new case.
These are analyses that you don't do to find statistical significance or to get a particular competence interval or particular estimate. There are analyses you do to try to understand the data. There were analyses that we put in our protocol that we would be doing but they were secondary analyses just to try to get some sense of what's going on here.
So this is the operating point of 20. Recall that we have this 0 to 100 scale so 20 reflects sort of the most aggressive end of the spectrum. We could go all the way out to 0 but 0 is just all the way at that end. Twenty was an area where you could imagine a fairly aggressive reader would say, "Even for a 20 I might want to do some kind of follow-up." Fifty was indeterminant on our scale so that is one operating point that is interesting to look at. Eighty would reflect sort of the least aggressive reader. This is by no means all readers. If I put this plot out with all 15 of the readers, you get sort of that weird scatter plot similar to what you saw earlier, but just to get a rough sense of what kinds of improvements are maybe plausible
So this dotted vertical line here is the line that corresponds to having the same false positive fraction. This is saying that if you started out at 50, your sensitivity could increase by this much without sacrificing your false positive fraction at all. Not one iota. If you think of the false positive fraction as your measure of safety and you think of the true positive fraction as your measure of efficacy, that is saying you can go up and get efficacy without any safety tradeoff.
Now, it's probably more likely that people are going to go a little bit up and over so maybe they are going to call more things. That's what we see with our individual rating. You can go up and over and still have the same positive predicted value. Even though you are giving up a little bit on the false positive fraction, you still have the same positive predicted value.
This 50 here is still a little bit over from that so it's not exactly the same positive predicted value but the basic point is that you can go up and over without having a sacrifice or without having a substantial sacrifice.
So these are the analyses that I mentioned. They were in our protocol as analyses that we were going to do, but I really am very sympathetic to what Bob Wagner said about these numbers. It's so hard to say what they mean. What are these numbers. I don't want anybody to run too far with these numbers but I do feel like it's necessary, especially for people who aren't statisticians, to want to understand what's going on with some of the raw data.
If we take 20 as the threshold for where somebody -- pretend that all readers treat 20 as their criteria for actionability, then we would have had 16 percent of the total nodules so there were 1, 125 positive quadrants that the 15 readers looked at. Sixteen percent of those would correspond to misses. With this very aggressive cutoff I think odds are those are, in fact, observational oversights.
Post-CAD that goes down to 11 percent so the 16 percent versus 11 percent, that's a 30 percent reduction in misses at that threshold. Now, that is a very aggressive threshold. Probably most readers aren't at that threshold. Fifty might be closer to where most people are at. It goes from 20 percent down to 16 percent. That's a 22 percent reduction in misses.
Then finally if we imagine that 80 is sort of a higher-end threshold of what might be called a miss, there is still a 15 percent reduction in misses. Now, these numbers are presented without confidence intervals, without p-values. Take them with a grain of salt. But in terms of understanding potentially the clinical importance, I think that maybe this may satisfy some of the desire to see a different number than just the delta Az.
I also wanted to show you what happens if we look at the true positive fraction and we look at the false positive fraction in a way that is probably more similar to the way that a lot of academic studies are done where you look at the cases where you are most likely to see an effect on the true positive side and you look at the unambiguous nodule absent quadrants on the other side.
Here I really am throwing out quadrants. As a statistician I hate to throw out data but I'm throwing them out just to get a clearer idea of what's going on here. So if we are looking at the true positive fraction just for the smaller nodules, and I'm just using -- they are not really small.
I think a lot of people would define small as less than 4 or less than 3, but the intermediate-size nodules as a proxy for difficult to find nodules or easily overlooked nodules. Then you can see that you get more of a rise in the curve without quite as much of a tradeoff early on in terms of the false positive fraction. This is analysis that was not included in our protocol. It's just something that I added to try to get a little bit more understanding of what is taking place here.
So the study conclusions. Again, the study conclusions go back to the primary analyses that we did and the robustness analysis. The study conclusions are that the imaging checker CT improves reader performance for the detection of actionable nodules. That was our objective and that's what we feel that we demonstrated. And specifically the results are robust to the analytical methodology, to the choice of the reference truth.
Again, it wasn't just looking at consensus and majority. We looked at minority, majority, consensus, and sort of a super consensus. Then it is also robust to the additional variation associated with selection of panelists. I described identifying two random panelists. We also did it with a single random panelist, with three random panelists and came up with very similar results.
With that, I'll turn it over to Dr. Delgado. Thank you.
DR. DELGADO: Thank you and good morning. I am Dr. Pablo Delgado. I'm clinical associate professor of radiology at the University of Missouri, Kansas City. I also practice at St. Luke's Hospital. I'm here to describe the beta experience that we're involved with.
First of all, I'll tell you a little bit about where I practice in the setting, where the beta site was performed. I am a private institution affiliated with the university. We have a hospital setting as well as an affiliated imaging center adjacent to us. We practice with residents available and we have an on-site residency training program of which I am the program director.
Our patient base is quite varied and I think rather common place for the region. It's a typical mid-west community base of private as well as community patients. Our CT equipment for our radiology department, we currently have two four-channel multi-detector CT scanners which happen to be GE QXI light speed scanners, although I don't think that's of importance to this device as long as it's DICOM data and meets the collimation thickness.
We currently perform anywhere between 20 and 30 CT studies a day of the chest and these different diagnostic indications including CT pulmonary angiography, high resolution CT of the chest, detection of other lung diseases, as well as multi-organ disease workups.
The beta study that we performed was between the times of June and August of 2003 for a total of eight weeks. We processed numerous studies. However, the goal of the study that we agreed upon and embarked upon was to assess the functionality of this image checker, CAD software, and how we would work with it to answer the R2 developmental group questions about radiologist preferred reading practices as well as work flow issues of how this would be incorporated into our practice. And to determine future applications of training needs in training radiologists in how to use this device. It should be noted that we were not asked to assess the clinical effectiveness of the CAD system.
The design of the system involved retrospective review of CT chest cases from our institution from previous months that have already been acquired and already been interpreted outside of the study and that met the collimation thickness which, I think, was already mentioned, 3 mm. or less and were contiguous slices of the chest.
The cases were read by faculty radiologists as well as residents so we got feedback from both experienced radiologist as well as radiologist in present training.
For the training of utilizing the device, we had an R2 application specialize on site for an entire day who got to work with most if the radiologists. A few that were not available for that time were given the training subsequently by those who experienced the training from the application specialist. That training process involved the description of the CAD algorithm, what indeed it does and what it doesn't with the review manual.
We also reviewed several institutional cases. First R2 had some cases of their own. Then we through the DICOM hookup were able to push some of our cases to the R2 device and process them so they were our cases. We also performed shadowing of retrospective reading sessions where the radiologists were able to work with the CAD device and subsequently ask questions if they felt that they were necessary or encountered any questions.
Our observations from using the beta product demonstrated that most radiologists, in fact all, demonstrated a rather rapid learning curve for using the CAD device. In a rather short period of time most people felt very comfortable in utilizing the product as is intended.
We encountered no specific technical errors or malfunctions. We had no difficulties. We did, indeed, use it in the way it was intended and we asked radiologists to first look at the case in a soft copy reading mode and then subsequently push the CAD button and activate it and then review it immediately thereafter. We found that all radiologists missed nodules that were detected by the CAD.
There certainly are false CAD positive marks as Dr. Castellino pointed out. However, most of these are easily dismissed by radiologists and that includes both faculty and residents.
Of course, I would agree with the comments made by other panel -- excuse me, other presenters from R2 that we feel that radiologists definitely should review all images initially without CAD and then a subsequent read with CAD. The reason for this is that CAD is not really made to detect every single nodule and, No. 2, the algorithm is such that it does not detect every single lung abnormality and radiologists are still responsible for detecting any lung abnormality.
In conclusion, I think that this product is very timely in what radiologists are facing on a daily basis. The development of multi-detector CT has led to an explosion, if you will, or significant increase in the number of images that are very detailed and radiologists are asked to interpret.
Numerous published studies have already documented there are limitations in radiologists' ability to detect lung nodules. I believe the detection really is the limiting factor of eventually determining actionability whether it is related to further diagnostic or therapeutic or interventional workups. We found CAD to me an effective tool in assisting the radiologist in the detection of lung nodules with multi-detector CT.
I will now reintroduce Dr. O'Shaughnessy of R2 Technology.
DR. O'SHAUGHNESSY: Thank you very much. I just have a couple of summary slides kind of to bring it all together at the end. I just wanted to reiterate the main conclusion from our clinical study for multi-detector CT exams of the chest, that the image checker CT CAD software system significantly at a p-value of .003 improves radiologist ROC performance for detecting solid pulmonary nodules between 4 and 30 millimeters in size.
And as both Mr. Miller and Dr. Castellino talked about and Dr. Wagner this morning, we feel that is a good measure for -- a reasonable measure for evaluating both a safety and efficacy aspect of the product. Also from the safety aspect, the product is intended to be used as an adjunctive device and with appropriate training we don't think there are any issues there.
Just to summarize, I'll put up again the same slides of the proposed indications for use. We thank you very much for your attention.
DR. IBBOTT: Thank you, Dr. O'Shaughnessy.
We are going to have time this afternoon for detailed discussion of this presentation but let's take a few minutes now to see if there are any questions for the previous speakers or clarification that's needed.
DR. STARK: I have a few questions. Other panelist, please jump in. Dr. O'Shaughnessy, thank you. By the way, it was a fabulous presentation.
DR. O'SHAUGHNESSY: Thank you.
DR. STARK: Very interesting subject and I think everyone is interested in seeing this technology succeed. Certainly I am so forgive me. Some of my questions are, I guess, by nature going to be -- are intended to be challenging.
Mr. Miller talked about, as the panel did, what the word significant -- he used the term significance is a very loaded term. Later on when we discuss the marketing materials and things like that, I'm worried about the pressures on radiologists to buy and use a technology and want to shift the significance to what really is clinically significant. In your presentation you pointed out -- I believe several of your experts pointed out that the real clinical problem is that we're missing about 24 percent of nodules or we are missing nodules at a significant rate. I think it was something like 24 percent or something, perhaps you can refresh me, were seen in retrospect.
One significant figure of merit here would be what fraction of those nodules that are missed, that 24 percent that are detectable in retrospect, are now detected with this technology given that the technology by itself has a sensitivity of about 50 percent for detecting majority and unanimous nodules and a 50 percent detection rate? I'm just asking. It's very, very low.
That would suggest to me that at best the technology is going to reduce that 24 percent missed rate to about a 12 percent missed rate at the cost of generating 100 percent false positives and then having a radiologist groom through and sort all this out by basically being said, "Do it again."
I'm wonder if we had a placebo in this FDA trial of, "Radiologist, just do it again, " or, "Here is the sugar pill. Just read it again," would we achieve the same presumptive 50 percent improvement in finding half of the lesions we know the current standard of care is to miss?
DR. O'SHAUGHNESSY: Right. I would like to answer that sort of in two parts. The first part I would like Dr. Miller to go over what we measured in our study and then have Dr. Castellino talk about translating that to the clinical environment if that's okay.
MR. MILLER: I guess there were a number of questions there. Is there one you would like for me to start out with?
DR. STARK: I think you will do a great job.
MR. MILLER: Okay. So the analyses that I showed at the end with the percent reduction in misses are sort of approximated percent reduction in misses where an attempt to get at that very issue. I suppose that it is to some degree your job and, to some degree, our job to determine what is clinically significant.
Now, the numbers that I showed you were sort of in the range of a percent reduction in misses of somewhere close to 20 percent. Actually more like 20 percent on the low end. That is similar to what the experience has been with CAD for mammography.
For CAD in mammography the percent reduction in misses has been in that range. I think if you are a person that's affected -- I guess I'm drifting off from statistics here. I should have handed it over to a clinician but, I mean, my hunch is that is a number that would be meaningful.
As far as the stand-alone sensitivity, I do want to sort of bring us back to the fact that we evaluated two modalities here. The two modalities that we evaluated were the readers stand-alone performance and the reader plus CAD. The whole MRMC framework is developed around those particular modalities.
CAD as a stand-alone modality is not something that anybody is recommending that people use. Therefore, those stand-alone numbers, I think, are less valuable but are more valuable if they pick up some of the more important things.
Also I think some of those things in the 4 to 10 millimeter range that readers react to and say, "Oh, I missed that. I'm glad CAD pointed out." It's more about what did CAD find than it is about exactly what the percentage is.
DR. STARK: Did you answer the core question of if the radiologist right now standard of care I would suggest, and clinicians can debate this, is that we miss a quarter of the lesions that are actually there in retrospect. If we can accept that as a statement, then as you design the experiment, what data are there to suggest we would cut that miss rate and by how much?
MR. MILLER: Will you permit me to go back to the slide? Sorry. I'll get there soon. Okay. This, again, is presented as an analyses that was specified in the protocol that we would do, but you don't have competence intervals there so these are numbers that you would want to put competence intervals on if you were going to put a lot of weight behind them.
Also, they make the presumption that readers all read with the same threshold cutoff and we know that's not the case. At a threshold cutoff of 50, let's focus on 50 for just a second, there were 228 missed quadrants. In other words, out of the total number of quadrants that the radiologist looked at, 75 positive quadrants times 15 so there are 1,125 times that one of the readers looked at a positive quadrant.
They gave a rating less than fifty 20 percent of the time. That is actually kind of a nice number because that number is not radically different from I think what we see in the literature. It may be a little bit lower. I think there's a little bit of a relaxed environment in the readings that they may be a little bit more likely to identify things. But 20 percent of the quadrant something is missed.
Post-CAD it goes to 16 percent so that's a 22 percent reduction in the misses. That is, I think, the number that is closest to answering the question that you raised. Is that correct?
DR. STARK: I think so. Let me see if I understand it and then I'll ask you about the affect on this analysis of the quadrant versus the lesion methodology.
MR. MILLER: Okay.
DR. STARK: I think that prejudice thinks in favor of the technology. I'm not sure. So you're saying if the standard of care currently is to miss a quarter of lesions, then of that 25 percent we'll miss one-fifth less so now we'll miss 20 percent of the lesions.
MR. MILLER: Yes. Their miss is defined loosely as you are not actioning a nodule that a consensus panel believes should be actioned. I don't think that they are actually missing it in every case. Sometimes they are giving it a low rating.
DR. STARK: Correct. But as far as --
MR. MILLER: Yeah.
DR. STARK: You can debate the inference but the literature talks about a missed rated of 25 percent which we are going to equate with actionable nodules. As we talk about the parent efficacy of this, and I appreciate your honesty, is that we are taking a standard of care of a 25 percent missed rate that juries and patients think is horrible in retrospect and we are going to cut that to a 20 percent missed rate. We can judge the -- that's the efficacy.
MR. MILLER: I should also add this is just based on jumping from one 50 to the other 50 on the curve. We did another set of analyses based on what happens if you jump from 50 to the other point on the curve where you -- I'm sorry.
I should say jump from 20 from one point on the curve to the other point with the same PBD and jump from 20 to the same point without sacrificing the false positive fraction. That also was a protocol specified analysis and the numbers go down a little bit. I don't remember how much but it may be five or 10 percentage points.
DR. CONANT: May I interrupt or just jump in for a second because you are the slide that I'm curious about. You mentioned it's similar to mammography. It is but it's so different. I'm very interested in the by-case analysis of this compared to by quadrant. The reason being I think you have a little bias in your case selection and I'm not sure if that is okay or not.
You have the majority of your cases, 62 percent of the nodule present cases, as people with extra-thoracic disease. I'm not sure I really care about the absolute number of quadrants you've missed because once you've got three nodules in both lung fields, who really cares? It's metastatic disease so I would want to see these numbers by case.
I also think the comparison to mammography is very different because I think that, again, chest analysis is much more multi-focal and reflective of systemic disease than mammography in terms of a bilateral fairly somewhat independent process. I would just like your comments on that if you could take this another step and then do it by case.
MR. MILLER: We did not do these analyses by case. I suppose the data are there to do it. I think the challenge with doing it by case is that the way -- I should let a physician get up here in just a second but the way that one would action a case where you had one lung where you had a very high likelihood of it being something bad, using my simple statistical language, and you had the contralateral lung where you had something that was probably bad. That one that's probably bad may actually be the one that drives the care of the patient.
Figuring out how you sort of wrap this all up and do something like this at the patient level with something that was sort of beyond the scope of what I was able to imagine. I absolutely do not disagree that it's something that would be useful to try to investigate in some way. Having said that, I think I really need a physician to answer the question.
DR. CONANT: I'm not sure what the answer is, though. However, in your cases it's very different if a person -- if you're looking for a primary lung carcinoma versus metastatic disease so they are very different clinical questions.
MR. MILLER: Yes. Let me let Dr. Castellino answer that.
DR. CASTELLINO: I'm not going to answer any statistical questions. I can guarantee you that. It is hard to answer that question. I would like to put it more in a clinical context of how we read cases every day.
I agree that if you have a patient with a soft-tissue sarcoma and you find three, four, five nodules, unless you are in a setting where you have surgeons who aggressively pursue that, as I was at Sloan-Kettering, at times it is important to find a six or seventh nodule. There is a spectrum of surgical behavior.
Let's assume that you find six or seven you don't have to find the last three. We had very few cases like that. The second thing is that we are not positioning this product as a lung cancer detection product, although it does work that way. Patients with lung cancer who had a nodule, it was not necessarily the primary lung cancer. They may have had lung cancer before treated post-op, post-radiation.
We accepted those cases and had a lung nodule in the lung for whatever reason so it wasn't really as a primary detection issue. I'm not sure I answered that completely and I do recognize that certain mammography is quite different, as I think we have discussed before, than chest CT.
I would like to go back to a couple of comments you made. If I understood you correctly, I think you said, Dr. Stark, that the issue was that we had a 50 percent sensitivity for consensus nodules. As I recall from looking at that, I think, with consensus we were closer to 80 or 83 with the classic nodule definition. I'm looking at the -- you'll see that later with Petrick.
If you stratify those nodules with what would be more definition that radiologists would call classic nodule. It ranges from 83 to 59 I think is the number. Is that correct?
DR. STARK: We can study it but I'm trying to draw data from table 10. When I suggested 50 percent, it was based on this so maybe over lunch you can --
DR. CASTELLINO: We can go through it. I thought it was about 59. But I think it's a good point. We would love to have developed an algorithm, to be very honest, that was 100 percent sensitive but this is the best we've come up so far. I think the issue to me as a clinical radiologist is how would this affect me or my colleagues in practice to find more nodules that we look at a year later and say, "My goodness. How did I miss that? Why did I miss that?" The ROC study, to some extent, I think, approaches that. I think this table here to some extent also would address that. These are nodules potentially that could be missed or are missed that the radiologist would say, "I would have liked to have seen that nodule to make a decision as to whether or not it's actionable or not." I don't know if I'm addressing the myriad of questions that you had but I would like to try to -- if you can rephrase some of them I would like to try to answer them.
DR. STARK: If the chair and the panel think we have time.
DR. IBBOTT: Let's wait until after lunch and we'll have that detailed discussion this afternoon.
DR. CASTELLINO: Can you write them out so I can think about them?
DR. STARK: I'm not sure of the protocol. I'll ask for advice.
DR. IBBOTT: I don't think there is any reason why you shouldn't present those questions and let them think about them over lunch.
DR. CASTELLINO: That would be very helpful because they are a lot and I think they are important questions. Thank you.
DR. IBBOTT: Again, I'll take this opportunity to ask Dr. Mehta if he has any questions that require clarification at this point.
DR. MEHTA: No, I don't.
DR. IBBOTT: All right. Thank you.
DR. SOLOMON: Do we have time for anymore questions?
DR. IBBOTT: Well, certainly. Especially if it's appropriate now to get clarification on something before we break.
DR. SOLOMON: I guess I have a couple of questions for Dr. Delgado. I guess they start off by asking you a little bit more about what your experience was with the system and then, more specifically, did you find that you as a radiologist or any of your colleagues were using the CAD system or becoming more dependent on the CAD system and not quite giving it the same kind of read that you would give ordinarily? Also, what was the impact on the time that you spent on a case? Did it make it longer or shorter? Why don't you answer those.
DR. DELGADO: Okay. Thank you. I think those are good questions. First of all, we did not do any time analysis with and without CAD or separate, just soft-copy interpretation and then soft-copy interpretation without CAD and then subsequently with CAD.
I think it goes to say that if you are doing the second review that there might be a time factor that would be slightly increased and that may be something to be quantified. However, in my experience I think, first of all, the first question is people were instructed through the training phase that this device was to be utilized through a primary read in which you make decisions on whether you see or detect a lesion and then there is a way for you to mark it. Then you activate the CAD and then you go through, as Dr. Castellino said, really not the whole entire study again but only those images that identified a lung nodule. It might be on average three per case or so where you might click on a button and that would take you immediately to that axial's image and show you a lesion of which then the radiologist would make a decision, "Did I miss this? Is this a significant mark that I would consider actionable?"
Or, if not, then easily discharge and be done with it. If it was a mark that is consider a false positive, that would be discarded easily. I think we did have a few of our radiologist which initially asked the question, "Well, is this benign or malignant?"
Yet, we made sure and I as the principle doctor in charge of this made sure to remind them that this was not the purpose of this device. It's really only to present you with a nodule that you may have missed and give you the ability to either add that to your findings or completely discard it. Does that answer your question perhaps?
DR. KRUPINSKI: This will probably be more for Dave. On point of clarification, you've got a quadrant and suppose the CAD during the initial view the reader says there's nothing there. There really is a nodule and then the CAD comes up and points out the nodule and a false positive.
Now the reader increases their confidence and now do you consider that in the analysis and how can you be sure? Do you consider that a true positive and an increase in behavior when, in fact, the radiologist was looking at the false positive? Is there anyway without localization to establish that? If you were then to take your cases and throw away any instances where the CAD marked a true and a false positive and the reader went from "false negative to true positive" what then happens to the ROC curves? Admittedly, although you've got statistical significance, those curves are pretty darn close and you've got these ambiguous cases now. How do you deal with that?
MR. MILLER: Well, the short answer is that we don't know precisely what happens in those instances. It was not captured. Bob Wagner talked about this best of both worlds scenario. We really tried in the way that we did the study not to take the readers out of their normal reading environment.
We felt that was very important and so capturing additional data was something that we thought could take them outside of their reading environment and create some kind of placebo effect essentially. We don't have that data on which one of the nodules or which one of the findings, I should say, which one of the CAD marks they are reacting to.
Now, having said that, we did after we completed the ANOVA-after-jackknife analysis you can pull out from that analysis which cases are the ones that were most favorable in terms of producing a CAD effect and which cases are least favorable in terms of producing a CAD worse effect.
I sat down with a dozen or so of those cases with Ron Castellino, our chief medical officer, and went through them and said, "Is it obvious what they're reacting to here?" In the overwhelming majority of the cases it was obvious what they were reacting to.
The number of marks per case is small enough that it is fairly unlikely -- I should say fairly. The case where you have multiple close to positive findings in a quadrant is not very common. It's common to have two in a quadrant but most of the false marks are very easily dismissable.
I mean, our engineers hate it when I say this but there are some vessels. I mean, not a statistician I look at it and I say, "That's a vessel." So the radiologist, it's really easy for them to dismiss those.
I guess the short answer is we did not do the analysis that you are suggesting but I completely take your point that it's important to figure out what was really going on in the ratings. I think I have a pretty good feel for it that they were reacting to true positives.
DR. KRUPINSKI: So you rate them all as true positives?
MR. MILLER: Yeah. I mean, the only thing that -- I mean, just from a programming perspective, the only thing that is fed into the analysis is the truth for the quadrants and the ratings. Whether there were or were not CAD marks there is not actually in the analysis.
You could do an analysis that was more of a parametric model and a fixed effect model where you tried to capture whether it was the quadrants with CAD marks that were causing the increase, but I think it's reasonably obvious that they are in trying to model that it gets pretty messy building that on top of the models that we already did.
Just while I'm up here, I did really quickly want to comment on the issue about the sensitivity, the back and forth about that table. I think you were doing a weighted average of some numbers in a table and we'll come back to that later, I think.
The sensitivity number -- I mean, it's just incredibly variable depending on sort of which reference truth you use and so if you hear different numbers going back and forth, it's not necessarily inconsistent. Two people may actually be both reading sort of off the same page but in a slightly different spot on the page. Thanks.
DR. IBBOTT: Thank you. At this point Dr. Stark has a couple of questions he's going to raise now to be discussed later this afternoon.
DR. STARK: Actually, it's a response to Dr. Castellino's question which I respect and it's fair. I have been working very, very hard for this because, as we'll discuss later, I have spent 15 years wondering why my ROC based prediction that MRI for detection of liver cancer in 1985 was significantly better than CT. That was wrong. I think I know why and I think this group here, the industry group and the panel, I think, were at the nub of it.
Dr. Castellino, rather than have us giving the formality and the importance of this scratching on pieces of paper, I've asked the chair to allow me to read. I've formed a question and I'm going to read it into the record and I'll give you my handwritten copy of what I'm going to read just so that we're clear on this. Forgive me. You've seen me scrambling over three minutes here. If any of this is unclear, I'll rephrase it. Thank you for offering to do this. Would you please calculate from the data and/or literature discussed or presented here today, and in your submission, the net decrease in false negative rate which we have here today estimated to be 24 percent for practicing radiologists working by themselves when those radiologists in the future, we're projecting, are to add this technology and these results, these data to their practice, specifically accounting for what Dr. Conant was just asking about, accounting for and not crediting as a detection or improvement with the addition of CAD those quadrants or patients as you compile the data where CAD marked a false positive lesion in a quadrant where the radiologist alone had a false negative.
Where that radiologist, in other words, failed to recognize a true lesion false negative for the radiologist that was not subsequently marked by the CAD.
I have this written down. I think that translates into English and I would be happy to clarify. Feel free to grab me during lunch if there is some nuisance of that that would make a better question.
DR. IBBOTT: All right. Thank you. At this point then, we'll call this session to a close and break for lunch and we will reconvene at 1:15, just a little less than an hour. Thank you.
(Whereupon, at 12:21 p.m. off the record until 1:18 p.m.)
DR. IBBOTT: Could I get you to take your seats, please, and we'll continue. Thank you. I would like now to call the meeting back to order and I would like to remind public observers of the meeting that while this portion of the meeting is open to public observation, public attendees may not participate unless specifically requested to do so by the chair. At this point Mr. Doyle has a statement to make.
DR. DOYLE: Yes. The R2 has approached me and indicated that they have developed answers to the questions that Dr. Stark proposed at the end of the morning session. In an effort to keep the meeting moving with the schedule we have, I have asked them to present those answers at the beginning of the discussion section this afternoon. They have the answers ready and I would just ask for the flow of the meeting to present those at that time. Thank you.
DR. IBBOTT: Thank you. We will now continue with the FDA's presentation on this PMA which will be introduced by Dr. Phillips.
DR. PHILLIPS: Well, in case you forgot what we're doing over lunch, we are discussing the image checker CT CAD by R2 Technology. It is a system that analyzes and displays to assist radiologists in review of multi-slice CT exams to the chest and in the detection of solid pulmonary tumors.
It is composed of several items. It's a combination of software and a computer. The system is a work station which is the image checker CT Model LN-500. This was approved for marketing under a 510(k) K023003, the software which is the operating system for the product that we are looking at today.
Again, the indications for use, and I don't need to read those. Then this was reviewed within FDA by a rather extensive team. Michael Kuchinski was the team leader; William Sacks was the clinical reviewer; Teng Weng was the statistics reviewer; Robert Wagner and Nicholas Petrick were reviewed for analysis methodology; Joseph Jorgens reviewed the software; Larry Stevens did bioresearch monitoring; Fleadia Farrah did the manufacturing. That's the quality systems regulation; and Ronald Kaczmarek reviewed it from epidemiological basis.
Two people will present to you today, Bill Sacks and Nicholas Petrick, discussing the PMA. The other reviews were all found to be satisfactory and we are moving on from there.
With that, Bill Sacks.
DR. SACKS: I apologize for the jaundiced look of that. It wasn't so bad in the rooms we were testing this in. Okay. I'm going to just give some background. Then Nick Petrick will present the data from the clinical study and then I'll come back and draw some conclusions.
The outline of my introductory comments, I'll say something about the character of the device for those of you who did in fact, forget over lunch something about the clinical utility, a point about the instructions for use, and some issues that are new to this particular PMA.
First on the character of the device. Just to remind you, this is for chest CT scans and for CTs that are done for any indication the algorithm is trained to detect solid lung nodules, not, for example, ground glass opacities. It is trained to detect nodules between 4 and 30 mm.
Also there was a Hounsfield unit cutoff which is just CT numbers, the amount of radiographic attenuation that needs to be above -100. In particular, this is a computer-aided detector. Just to say a word about the difference between computer-aided detection and computer-aided diagnosis, a point I made earlier.
The difference between detection and discrimination lies not in the instrument but in the clinical use to which it's being put. The detector system, which is what we're talking about today, this left-hand column, scans entire images whereas a discriminator only scans portions that are selected by the user. The detector marks the images where a discriminator will give a level of suspicion that is just a number. As I say, the same device will do both but it is thresholded to give you marks when it's acting as a detector.
On clinical utility, as we've heard, many nodules are missed in clinical practice for two major reasons. One, other pathology distracts and hundreds of images are present in one CT of the chest. Indeed, you may start out as a board certified radiologist and after reading 500 images you are certified board.
A CAD is intended to reduce the missed nodules, this CAD. That is, it is intended to increase the users sensitivity to detecting lung nodules. We will come back to this point.
Instructions for use. The important points are that the reader should review the films unaided first. Then the CAD marks the candidate nodules. Then the reader looks again in the vicinity of those marks.
If the CAD fails to mark a nodule that was judged actionable on the initial unaided review, the instruction in the labeling reads that the reader should retain that initial judgment, not back off just because the CAD failed to mark it. We will come back to this in my closing comments.
Issues that are new to this PMA are should the particular choice of target for the CAD algorithm, the definition of truth, the unit of analysis and endpoints. I'll say something about each of those.
First, on the CAD target, the target is not malignant nodules but actionable nodules as we've heard which, among other things, means that the definition of truth is not based on biopsy or tissue histology which would be an external standard, but rather based on the judgment of an expert panel that is an internal standard based on the very images that are being evaluated here.
The unit of analysis, as we've seen, at one level of the statistical unit is the person but it's further broken down into long quadrants and Nick Petrick will say more about that.
Finally, the end points. One could do an entire ROC curves as was done and one could, as Bob Wagner explained this morning, in addition, or instead of, do the sensitivity and specificity of a particular action recommendation which was not, in fact, done in this particular study.
In summary, again, just to remind you, the clinical study consisted of three expert radiologists drawn from a group of 11 but three at a time on a panel to determine what was called by the company reference truth for each nodule. Then there were 15 completely different radiologists with a range of experience, not necessarily experts, that were called the readers and they all 15 read all 90 cases and the 90 subjects were divided into 360 long quadrants. Those 15 readers used a 100 point scale for a confidence and actionability rating for each case.
Now I'll introduce Nick Petrick who will give you the clinical data.
DR. PETRICK: Okay. So my name is Nick Petrick and I will go through -- let me see which one of these work. I'll go through the clinical results that were done by the sponsor and some of our perspective. The outline of my talk will be first to talk about the applicability of Az in the analysis. Here I'm using the term Az which is somewhat more of a technical term but this is the same as the area under the curve or AUC. Other people may call it area under the curve or AUC but I'm going to use that as meaning the same thing here.
I will also talk about and somewhat review what the sponsor presented on the pool of cases used for the clinical study. I'll talk about the definition of actionable nodules by the panel of experts. Then I'll go into the particulars of the clinical study.
In particular, I'll talk about the primary analysis which was analysis using a fixed panel of experts and then what is somewhat of importance here, the secondary analysis which was the analysis using random panels of experts.
Then I'll finish up my presentation by talking about the measurement of CAD stand-alone performance. When I'm talking about stand-alone performances this is the algorithm performance with no reader involvement.
Okay. So for the applicability of the agency here, I show one of the sponsor's curves for the average reader ROC from predisposed CAD and this had a change in the area under the curve of .024 and a p-value as shown there .003.
What's important to note about the applicability of the Az is that degree in curve here is the pre-CAD and the reddish curve is the post-CAD. And what we're looking for is that the two curves don't cross. That is an important measure if we are going to use Az as an overall performance measure for ROC analysis. What we find from this average curve is that generally the post-CAD curve is higher or on the same order as the pre-CAD curve.
So just to summarize this, the pre-imposed CAD curves did not cross in the average performance I showed before. I think, more importantly, there was no substantial pre or post-CAD crossing in either the average or individual ROC curves. This is important. That makes the Az statistically appropriate performance measure for this type of analysis. If they had a significant crossing, we would have had to look at some sort of partial area or some other measure of performance in that situation. Because of this conclusion the sponsor had used an Az as a figure of merit in all their analysis that follows.
Okay. Now to talk about the pool of readers. Again, just sort of a summary of what the sponsor had talked about before. There is a pool of cases. There was a subset of that which was made of nodule cases. These were documented cancer cases so the primary neoplasm or extra-thoracic neoplasm with presumptive spread to the lungs. That is the set of nodule cases. The cases were allowed to contain non-nodule pathologic processes, things like pneumonia or emphysema and so forth were allowed to be part of that subgroup.
They took another set of cases. These were considered the non-nodule cases and what they term or what can be termed as normal cases where there was no nodule deemed present by the site PI and that site PI primarily relied upon original radiology reports in coming to that determination.
These cases could include a history of cancer, radiation therapy, or even previous thoracotomy were allowed to be in this data set. This is a pool of cases that now the sponsor will pull out cases to run their ROC reading studies from.
At this point we're not going to talk about -- we are going to talk about actionable nodules or the object of interest in this application. In particular, there is a panel of expert radiologists that identified the actionable nodules. This was done in a two-stage process, again, just as a review as before.
In the first reading the cases were independent and blinded by three expert radiologists. The information provided to the radiologists were the subject's age, gender, and indication for the exam, obviously along with the exam as well.
Each individual radiologist marked all findings deemed to be lung nodules. Then the radiologist provided ratings for each of those nodules so there is a detection test and then there's a rating of the actionability of that nodule. It could have fallen into an interventional category. That is an actionable finding where further workup was advised.
A surveillance which is, again, considered an actionable finding which was monitored with follow-up studies and this would probably be more typically additional CTs. Also, they could have rated as probably benign calcified. Again, no action required here, or probably benign noncalcified, no action required.
After the first pass was done, findings that lack 100 percent consensus after that first pass were reviewed unblinded by all three radiologists and basically they are going to reevaluate locations where either two out of three of the panel or one out of three of the panel call the location a nodule. then the radiologist would rate or rerate these on the actionability of the nodule candidates.
Along with this thresholding was applied to match what the general performance of the area where the algorithms should be performing and so thresholds of greater than 4 mm. in diameter for each nodule candidate and a peak density of greater than -100 Hounsfield units. This considers a CT number and is related to the attenuation coefficient in grayscales in the CT exam.
Then after each nodule was identified, each lung quadrant was categorized based on the highest actionable finding within that quadrant. Then subsequently the quadrants will be used in the observer studies.
Now, just to summarize what was found in that initial pass, again, this is three experts per panel. I'll show in this column the unanimous actionable. That's three out of three finding. Majority actionable two out of three. Minority actionable one out of three. You can see that for unanimous actionable there was 142 findings. For majority there were 168. For minority there were 149 findings.
This gives you somewhat of an indication that panel variability is an important component here. There's a lot of cases, almost a third -- only about a third of the cases were unanimously actionable and another third or so were two out of three, and another third were one out of three. This gave the FDA an indication that panel variability was an important component and probably should be taken into account in the clinical study.
Now to go into the clinical study, there were multi-reader, multi-case ROC observer studies. Again, the test statistic was the Az or area under the curve. I'll present net results based on analysis of 90 case data set, 360 quadrants. The sponsor also performed a 32-case study and also presented pooled results of the 32 and 90 cases. I'll just limit myself to the 90-case study.
What's important the MRMC allows us to look at the variability, confidence intervals, and significance testing and we can take those into account. That is important obviously in this case to determine significance and then to try to get an idea of what the separation is between the reading without CAD and reading with the CAD device.
In order to analyze the variability confidence intervals and significance two approaches were used, ANOVA-after-jackknife and bootstrap analysis. So here is just the general flow chart to the clinical study and this will be followed for all the clinical studies. The study starts out with a pool of readers. These are going to be the group of radiologists that are going to actually read the cases and give rankings for each quadrant.
There's a pool of cases and there's a pool of experts and the experts will be used to define truth. There will be a sample pulled out of cases. It will be used by the pool of experts to define nodules. There will be a set of readers picked out. Those cases will then be read using multi-reader multi-case ROC observer study and an estimate of the Az will be calculated. This could then be redone for different case sets, different reader sets, and potentially different experts on a panel.
So the important components here are how to measure the variability confidence intervals and do significance testing. Again, two approaches were taken, ANOVA-after-jackknife analysis. This is a parametric type of analysis and just jackknife if a leave one case out type of analysis.
Again, we're talking about leaving out a whole case so you're leaving out all four quadrants together and then performing a quadrant-based analysis on that. So just as a quick example, if we had a case set of case one, two, and three, when jackknifing is performed or leave one case out, the first partition is going to be one and two. We've left out case three. The second partition may be set case one and three, case two has been left out.
Finally partition would be two and three leaving case one out. Then using those partitions and looking at the pseudo values that come out of that you can use ANOVA to estimate the variability confidence intervals and significance. The analysis assumes modality as a fixed effect and readers, cases, and all interactions as random effects in the ANOVA.
A second approach to doing this is bootstrap analysis and this becomes important to look at variability of the truth panel. This is, again, just to repeat, is a nonparametric analysis. What happens is randomly generated data sets are created based on the original data using replacement. Just as another quick example, with a case set of one, two, and three again when you run bootstrap you use replacements of the first partition, randomly pick maybe case three, case two, and case three.
When you do the analysis you assume that case three and case three are really separate events and we bootstrap across those to get those potential partitions. The second partition you may pick case three, case one and case two. Here all the cases have shown up equally. Then a third partition may be case one, case one, and case two and so forth.
So the primary analysis, again, the same basic diagram as before but now there's a resampling scheme introduced into the analysis. The resampling is used for the pool of readers, again, the people that are going to -- the radiologists that are going to rank the quadrants and the pool of cases.
The truth is based on a fixed three-member nodule definition panel, again, based on unanimous consensus. The analysis will be based on ANOVA-after-jackknife. Also bootstrap analysis was also performed. What happens here is a pool of readers go in. It's resampled so it picks out a subset of readers. Likewise a subset of cases is selected using a resampling scheme. The cases go into the definition panel where the panel is fixed and define the actual nodules of interest or the quadrants that are positive or those that are negative.
The set of readers are then randomly selected and go in and perform the ROC experiment. That gives one estimate of Az. This process is repeated either through jackknife or bootstrapping in order to get estimates for the variability and allow for confidence intervals and significance testing.
So just the result of the clinical study. Again, this is for a fixed three-member nodule definition panel. In the first column I show the pre-CAD Az for both jackknife and bootstrap. The second column is post-CAD, the change in the Az, the p-value for that particular test, and the lower and upper confidence intervals.
You can see that the results are fairly consistent between both jackknife and bootstrap with a pre-CAD Az of .881 or .879, post-CAD increasing to .905 or .903. With change on the order of .024 we see fairly small p-values for both the jackknife and bootstrapping. Then the confidence intervals also fairly consistent.
We wouldn't necessarily expect the bootstrap and the ANOVA to give us the same values but it's nice actually to see that there is consistency here between the two analyses.
So just some conclusions on the primary analysis. The sponsor has shown a statistically significant improvement in Az from pre to post-CAD and that is on the order of .024 or change in area under the curve.
The ANOVA-after-jackknife and bootstrap analysis showed consistent performance in both significance and confidence intervals. The analysis, however, was limited because it did not take into account any variation in the expert panel. Variability of the panel would add uncertainty to the performance estimates, or we anticipate that variability in the panel would add uncertainty to the performance estimates.
This is, I think, an important factor because we don't have this cold standard of truth. We are dealing with a panel truth. We expect if we sampled a new panel, they may come up with a different set of cases. They certainly would come up with some different nodules there.
One of the important questions is how would performance change with a different panel makeup. That is one of the questions that we had talked to the sponsor about addressing. In particular, looking at a different number of panel members so if you have a different panel makeup or a different definition of truth potentially and different sets. What happens if another set of experts was used.
So a secondary analysis was conducted here. I'll show there are many different types of analysis done by the sponsor. I'll concentrate on one set of random panel makeup. This will be based on a random three, two, or one-member panel, nodule definition panels and assuming the definition for truth is unanimous consensus.
Because of this type of analysis the ANOVA-after-jackknife isn't applicable at this point so only bootstrap analysis is possible. It follows a similar scheme as before. We, again, start with a pool of readers, pool of cases, pool of experts. Here, however, bootstrapping is applied to the pool of experts as well so that we have a different panel makeup for defining truth. That adds variability into that definition of truth and we can use our MRMC ROC observer study to take into account that variability.
So we use bootstrapping to select a group of readers, a group of cases, and a group of experts. Again, with that particular combination we get an estimate for Az. That study is repeated a number of times to allow again to look at variability where we have included variability of the truth.
So, again, these are random three, two, and one member nodule definition panels. When I'm talking three-member panels I'm saying unanimous consensus. Three out of three have to agree. When I get results for two members that means two members.
They both have to agree. Obviously for one-member panel it is the opinion of one of the members. The sponsors randomly sampled that panel so that we get the added variability from having many different experts involved.
Again, the same layout here. The pre-CAD Az, the post-CAD, the change, the p-value, and the lower and upper confidence intervals. We can see from pre-CAD this measurement of performance was .845 increasing to .868.
For the three-member random panel a change of .022. For a two-member panel it was .832 increasing to .854, again a change of about .022. One-member panel .817 increasing to .838. Again, a change of about .0. This is 21 but very similar 0.22 on average.
We also see fairly consistent upper and lower confidence intervals for all different definitions of the truth. Then we see the significance values which are fairly small as well. That's sort of interesting because what I talked about before was that we expected when we incorporate randomness of the panel in here, we would see an increase or a decrease in the statistical significance that this would be a harder -- that it would be harder to chose statistical significance.
Really we see similar p-values to what we saw when we had a fixed-member panel. One of the possibilities or one of the trade-offs that may have occurred was something that Dr. Wagner talked about this morning where when the definition of truth is varied, we have also varied the case mix or the differentiation between negative and positive findings so we have now moved ourselves potentially more off the curve where we have a more closer balance study which gives us effectively a larger number of cases or a larger number of effective cases.
That was traded off against the variation in the truth. Those seem to potentially have traded each other off where we don't see a big difference in the performance. This is one possibility. It's certainly not conclusive in any way but it is somewhat surprising that we didn't see a larger variation in the truth when we randomize it.
So just some conclusions on the secondary analysis. This analysis take into account the random nature of the expert panel for defining actual nodules. In particular, it took into account different number of panel members and different panel makeup using a bootstrap selection of the panel.
All variations of the panel make up confirmed a statistically significant improvement in the Az from pre to post-CAD and this change was on the order of .02. And just a more general conclusion, this type of analysis where we actually tried to randomize the panel makeup is likely to be a more appropriate type of analysis for assessment of devices when panel truth -- when only panel truth is available. That's obviously the case here but we can anticipate other devices potentially coming in where this will again be an issue.
Finally, I would like to talk about CAD stand-alone performance. In particular, this is a performance of the CAD algorithm alone and it's the algorithm's sensitivity and specificity with no reader involvement so we are just going to measure the performance of the algorithm on some set of cases or defined nodules.
Why may this be important? Well, it's generally important because the radiologist can use this information to appropriately weigh their confidence in the CAD marking so this is a measure. If you are a reader or a radiologist trying to purchase this device, you generally like to know how it would work. Or if you have the device to use, to get a feel for how it's performing and what it might be marking.
Likewise, it potentially can be used as a benchmark for future revisions of the algorithm so as an FDA perspective knowing some benchmark of performance may help us to determine how to evaluate new revisions of this particular algorithm when it comes in.
The question becomes what's an appropriate performance measure for this particular device and this isn't necessarily an easy question to answer. Anecdotally the sponsor went back and looked for the unanimous three out of three fixed-member panel and look at those on the appearance of the nodules that the radiologist marked.
What they found was that many of those 142 findings did not meet the criteria of solid discrete spherical density. They subsequently went back and reconvened a second panel to reevaluate the nodule but only based on appearance. Not to find new nodules but just look at the appearance of those nodules defined.
They put together a set of five independent radiologists and they were asked to categorize the nodules into two categories, either what they define as classic nodule. These are discrete, solid, spherical ovoid nodules, or as nonclassic nodules. These would be nodules that may not be discrete. they may be hyperdense, irregular in shape. They may be potentially normal structures that for whatever reason may not be considered nodules at all. This new panel is only going to look at the appearance of the nodules and determine whether they are classic or nonclassic in appearance.
This is a performance. In the first column I'll show the number of panels defining the nodule as classic. Again, there was a total of five. I'll just group together zero, one, and two out of five. I'll give the number of findings. The true positive fraction, the sensitivity of the CAD algorithm to those particular subset of cases.
In general I'll just summarize the CAD false marker rate. Then I'll give a final column to the median diameter of the true positives detected. This is just to give an idea if there is any bias on the size of the nodule based on how many panelist defined it as classic.
So in the first category less than three out of five there was a total of about 65 findings. The sensitivity was on the order of about 32 percent. For three out of five there was a total of 13 findings, sensitivity of approximately 70 percent. Four out of five of the panelists saying this is classic in appearance the performance jumps up to about 82 percent. All five the performance is about 83 percent.
If you just combined all these findings together a total, again, of 142 based on the definition of truth. The sensitivity is on the order of about 59 percent. The CAD false marker rate, it varied between two and three depending on whether the sponsor incorporated or didn't the equivocal nodule. If you had a five-out-of-five rating, what you did with the zero, one, two, three, four out of fives whether you included those or not as false positives would change the median false marker rate but it's on the order of two or three per case.
In the final column we see that this is a range of the diameter to those true positives. You can see that it ranges from about eight to nine. For the less than three out of five it was 7.4. For three out of five it jumped up to about 11 and fell down to seven again. The idea of this column is just to show there doesn't really seem to be a bias associated with how large the lesion was based on how they rated it as classic or not.
Just as a final summary, if there was less than three out of five panelists, there was approximately 65 findings and the sensitivity was about 32 percent. If it was greater than three out of five, there was about 77 findings. This is about half and half -- relatively close to half and half for the data set. The sensitivity jumped up to about 81 percent.
So just in summary for the CAD stand-alone performance, what was found by the sponsor was there was a large variation in performance of the CAD based on the physician's assessment of the nodule's appearance as classic. Whether it was classic or not would make a big difference on how well the CAD performed.
Just a note, generally the CAD -- the sponsors talked about the CAD being associated with these discrete spherical types of lesions and not necessarily some of the other types of lesions that were potentially marked.
So just in summary for this part of the presentation, what the sponsor found was that the -- what we found was that the Az was an appropriate test statistic for the clinical analysis and this was based on the fact that there was no substantial crossing of the pre and post-CAD ROC curves.
The primary analysis, this was based on a fixed three-member expert panel. It showed a statistically significant Az improvement in the detection with the CAD. What was also found was the ANOVA-after-jackknife and bootstrap showed comparable significance testing and confidence intervals.
The secondary analysis, this was with a variable number of panel members where the sponsor varied the number of panel members. They also varied the panel makeup using a bootstrap selection of the panel members so this is a random panel mix now. This confirms statistically significant Az improvement in the detection with CAD.
Then, finally, for this CAD stand-alone performance what was found was that there was a large variation in CAD performance based on the reassessment of the nodule's appearance. A more general conclusion from stand-alone performances is that this type of analysis is necessary for appropriate utilization of the device by the clinicians in the field and for potentially reassessment of future algorithm revisions.
Now I'll turn it over to Dr. Sacks again to make some conclusions.
DR. SACKS: Okay. I want to then draw some clinical conclusions about this statistically significant gain. Granting the statistical significance of a gain in Az of .02, what is the clinical significance and this is a point that was discussed somewhat this morning.
Let me recall for you an earlier slide that I have excerpted this from. That is, that the clinical utility of this device is that the CAD is intended to reduce the number of missed nodules. That is, it is intended to increase the user's sensitivity, not increase the area under the curve, although that is related.
A gain of .02 in Az understates the relative gain in sensitivity. Why is that? When the CAD is used according to instructions to retain all judgments of actionability, even if unmarked by the CAD, the user always necessarily maintains or increases his or her sensitivity and, indeed, always maintains or increases the false positive fraction as well. They both have to go up. They could stay the same but that would be an extreme case that wouldn't likely happen, but they cannot go down either one.
What that means in ROC space is that -- let me walk you through this slide -- the blue curve is intended to be a representation of the unaided initial reading. The red curve is the aided reading. We've been talking about the difference in area between under the red curve and under the blue curve.
But if you talk about a particular operating point on the blue curve unaided and ask what happens when you use the CAD, you move to some point on the red curve and if you obey those instructions not to back off when the CAD fails to mark something that you thought was actionable, you necessarily move up and to the right somewhere in that quadrant such as this arrow here so you move to some point here.
Now, Dave Miller showed you a number of representative arrows if you were to use a particular point on the rating scale on the blue curve and keep that same point on the rating curve -- on the red curve, the same rating, 80 or 50 or 20.
The problem is that radiologists while they could read by assigning a number to a study and always obeying a preset range for themselves saying, "If I assign any case 70 or more, then I am always going to act on it the same way.
If I assign between 40 and 70, I'm always going to act on it the same way. If I assign under 40, I'm always going to act on it in the same way," then those points might be relevant. Radiologists could do that but I'm a radiologist and I can tell you radiologists don't do that.
What they do do is they look at a case and they decide, "Do I act on this or do I not?" Or if there is a trichotomy such as in mammography where there is biopsy or short-term follow-up or return in a year for screening, that is the decision you make. That gives you an operating point that may or may not lie on the curve that you would construct if you gave a rating.
It wouldn't necessarily lie on that curve. It would lie on that curve if you always assigned your action based on a preset fixed range of ratings. But because those are done independently, those modes of thinking, the point that you operate on in terms of actual sensitivity and specificity may or may not lie on the ROC curve.
For this particular clinical study we don't know but what we do know is if you maintain that rule, and you are free to violate it if you are going to, but in this clinical study people did not violate it and what we can see is if we put this in the labeling and say to potential users out there, "Stick with this rule and you are not going to lose sensitivity," then what you're going to be doing is moving up and to the right.
And you can see from this gain in sensitivity, this increment here which is along -- TPR is just true positive rate or fraction. It's just another word for sensitivity -- that increase is a little more impressive than .02. I can't quantify it but you can expect that your gain in sensitivity is going to be greater. The utility of knowing that the red curve is higher than the blue is that you know that you're not so greatly increasing your false positive rate as the fall to a lower curve.
Now, here is an example. For example, if I start here and I maintain that rule, I'll go up and to the right but if I don't, I could fall even though I'm going to a higher curve from blue to red. Those are the same two curves as in the previous slide. Nevertheless, I could drop my sensitivity if I don't follow that instruction.
So any statistically significant improvement in Az means an even greater relative gain in sensitivity and one achieved without falling to a lower ROC curve if the reader maintains that rule not to back off if the CAD fails to mark something that he or she thought was actionable to begin with.
Now, another point. The real question for judging the safety and effectiveness of any device is how does its introduction into general use compare to what we have today where it doesn't exist? The same question applies to a CAD.
Can we infer from the fact that there was an improvement in the average user performance measured in terms of Az in a clinical study that the average user will improve his or her performance, again measured in terms of Az, with the CAD in clinical practice.
That is, improve over his or her current clinical performance which is in the absence of any CAD for miles around. To put it another way, is the unaided reading in a clinical study a good surrogate for current CADless clinical practice?
What I'm showing here is let's suppose it is a good surrogate. Current clinical practice may have a CADless reading Az somewhere here along some Az scale. In a clinical study if the unaided reading is a good surrogate for that, then the fact that the aided reading is higher than the unaided reading, then the aided reading is also higher than current CADless clinical practice.
But, for example, in actual clinical practice with CAD, that is, in the future, the unaided Az could be lowered potentially by failure to read first as one would normally read. That is, with adequate vigilance. If this were to happen, then the aided Az could also be lower than the current CADless practice. And to show that in a diagram, in other words, if this aided reading had an Az that was significantly lower than the Az in current clinical practice, it could pull down with it the aided reading so that it was below your current practice.
In order to avoid that, well, what would be the implications of such lowering of vigilance for judging the safety and effectiveness of the CAD? Can labeling help prevent this? Labeling issues. Two rules if followed by CAD users in future clinical practice with the CAD will help prevent missing more nodules than former reading without a CAD.
The first rule is an always rule. Always read unaided first and as carefully as if you had no CAD. This would help keep the Az of the aided reading higher than the Az of the current CADless reading. We can't make users of this follow these instructions but we can guarantee that it's in the labeling.
Secondly, the never rule. Never back off from unaided judgment of actionability of a nodule if the CAD fails to mark it. This would prevent the sensitivity from falling below that of current CADless sensitivity. That is, it would prevent the radiologist from missing more rather than fewer nodules.
DR. IBBOTT: Thank you. At this point Dr. Wagner has a short presentation to make.
DR. WAGNER: Yes. This is just a trivial comment but it is along the lines, I think, of where Bill Sacks was going just a moment ago. The number .024 may sound small and he showed how it may have a bigger impact than that small number sounds.
If you do an area under the ROC curve here is the good stuff, .85. Here is the bad stuff, .15 or .12 or whatever. That .024 is also the correction improving the false negative piece and all the inference that was done on the area under the curve difference because it's just a difference between one. Here is the curve and here is the area under and here is the difference. The difference is just one minus everything else we've been discussing today.
The statistics of one minus something are the same as the statistics of that something so the area under the curve is also the reduction in false negatives with all the statistics in there averaged over all the false positive rate so that is another interpretation of that.
So .024 may not sound like a lot compared to .86 or something like that, but it also is to be compared to .15 or .14 or whatever the missing piece is. All the statistics if you consider them tight for the previous part, it's the same statistics. I don't know if that helps but .024 looks a lot better compared to .12 or .13 than it does to .85. That is statistically robust. Thank you.
DR. IBBOTT: Thank you. Before we go onto the lead reviewers, I'll take a moment to see if people have any questions of these recent speakers, particularly questions in the nature of clarification again before we get onto the real discussion later.
DR. KRUPINSKI: Can you clarify or explain without getting into all the gory mathematical details when you go from looking at quadrants, then the Az is based on patient. For example, suppose you've got your four quadrants and you've got a true positive here, a false negative here, a false positive here, and a true negative here as quadrants. When you then go to Az on a patient, is that patient true positive, false negative, false positive, or some weighted combination there? Anybody who knows the answer.
MR. MILLER: I think it's a quick answer. So when you compute the Az all four quadrants are in there for computing the Az but that is to compute it originally. Then when you do the jackknife you pull out all four quadrants so, therefore, the jackknifed Az, which is the unit of analysis for the ANOVA, is based on the case because you pulled out the four quadrants together.
But when you compute the Az, you do have each of the four quadrant ratings compared against each of the four quadrant truths. This is discussed in Nancy Obuchowski's paper from Biometrics in '95. I ran those programs as well as the ones that we did just to make sure that we were getting the same estimates.
DR. KRUPINSKI: So all decisions are preserved basically.
MR. MILLER: All decisions are preserved.
DR. IBBOTT: All right. Then we will now have some brief presentations by the panel's lead clinical reviewer, Dr. David Stark, and the lead statistical reviewer, Dr. Brent Blumenstein.
DR. STARK: Thank you. I would like to begin by congratulating the applicant, the industry in general, the FDA staff, and the panelists. This discussion and the record of it, I think, documents substantial progress that's been made in the methodology for research and product development in this field of computer-aided diagnosis or detection and, frankly, the verification of these results so that we can apportion resources responsibly and regulate and improve overall quality of clinical care.
As is noted in what I've read from the FDA's notes from last year, in particular, this application, this issue, is a very prodigious task and the technology is quite similar to really more to putting the Spirit and Opportunity on Mars than most other things that we clinicians face or have historically faced in our training in how to decide how to care for patients.
Nothing really could be further from the way a surgeon decides whether they are going to start doing laparoscoic cholecystectomies without any review or oversight as opposed to exploratory laparotomies. I'm a little bit concerned about the fastidiousness and the zeal with which we are putting -- are obsessed with technology. I'm assessed with technology.
Some of the panelist here are devotees of little PDAs like I am and things but there are many red herrings here. There are many unintended consequences and this is an extremely important task we have in front of us, not to move too quickly and not to move too slowly.
I just want to remind everybody that with those two space landers, spaceships that we've been following with our families and children, the unintended consequences for something that is manmade and simply mechanical like an overheated solar panel or a flash memory that's choked with data, this is more complicated than the Challenger accident or putting Spirit and Opportunity on the moon in my humble submission.
That is because we are limited. We chose to go into biology but there are enormous biological variations here. Even as a group I doubt we have the collective wisdom and strength to recognize all of them given the time that we've had.
There are numerous coincidental clinical issues and this panel has focused largely on what I believe is a red herring of the statistic of Az. I implore people not to think that just because we can launch a bottle rocket we can reach the moon.
My own papers which I have cited to the committee have shown a larger and more convincing increase in the detection of liver metastases with magnetic resonance imaging using exactly this same methodology with some of the same authors and we were wrong because of some of the issues that have been raised here today.
A statistically significant phenomenon in the laboratory with all of these little nuisances tearing and pulling at it, and these are only a fraction of the issues, can give us the enthusiasm that we can reach the moon but there will be problems with insulation flying off rockets and things like that. The unintended consequences are what I'm concerned about as we talk about safety and efficacy.
First about some of the red herrings. The burden of the 300 scans, it is a burden to have to look at so many images but that's a bit of a myth. One of the ways that we have improved our efficiency as radiologists in reading these images is we no longer tile them.
We melt through a stack with a trackball so the soft film reading to a certain extent mitigates the number of scans. It really doesn't matter if you have 50 or 500 to a large degree if you are trackballing through a stack.
Furthermore, this product doesn't address that issue because it really is largely asking the radiologist to still do his conventional work and then add additional readings to it, albeit slice by slice computer selected.
The problem that we're here to face, to solve, and the industry is trying to address, as the physicians are, is that we have a false negative rate in detecting nodules in the lungs that is unacceptable to clinicians, to the public, and to healthcare providers and those who fund it.
This study population does not reflect the fact that the false negative rate of 24 percent is a number we've all been, I think, using by assent here today but it's different depending on the study group, I'm sure. That might be plus or minus 10 percent, 15 percent easily.
But that's 24 percent of one in 100 that's positive. The radiologist faces 99 negative scans for every one that he has to find. These study conditions the radiologists faced two out of three were positive and positive perhaps in multiple quadrants.
The false positive fraction, which is quite large here, is over a very large denominator. 99 out of those 100 patients who have truly negative scans will bear the burden of the false positives, the patients and the radiologists caring for them.
The question of efficacy can be as simple as if we assume the radiologist is perfect and just plays by the rules and adds no false positives, then he has the ability perhaps to improve his false negative rate which is embarrassing but it's the state of the art of medicine from perhaps 24 percent or worse to perhaps 10 percent better. Still horribly embarrassing so we have some meager pickings and still an unacceptable result, I think, from a final objective. Nonetheless, a step in the right direction would be a step in that direction.
The false positives, though, while I believe we do have in one of the curves, I think it's figure 11 on page 53 in one of the two studies, did show some degradation of performance where the radiologist somehow managed to not be perfect and to eliminate all the false positives which is unbelievable that they can do that.
I believe some radiologists are going to be induced to call things positive. It's just not realistic that another look when you are prompted to ask you are going to cause some more false positives and there is the possibility of degradation with scatter around the ROC curve. These will whether they are due to distraction error, there will be, maybe unmeasurable, but as happens in medicine unnecessary biopsies.
Dr. Castellino talked about the effect on treatment if you call five lesions instead of three. Some surgeons will say, "I won't operate on three pulmonary nodules and do a metastasectomy. I will go to chemotherapy." Or if there's two. One pulmonary nodule, we'll excise it out. If there's two and it's a false positive, no chemotherapy. Unilateral versus bilateral disease, no surgery.
So the consequences of a mistake, a false positive is huge because they add to that minority, that one in 100. And there are, of course, the complications of follow-up CT scans with or without contrast. I'll get to contrast media later. We haven't discussed it today but one of the claims is that this is effective with or without contrast media but we haven't seen, I believe, data on that point.
So one of my concerns is the nonclinical circumstances in terms of the patient mix, the circumstances of the readers. We had at least one reader who read 90 cases in a day. There were more than one. They may have been exceptionally strong readers but we know they weren't reading under clinical conditions. They were ignoring many of the things that radiologists are obligated to worry about. Radiologists are not limited in their obligation to work with this machine. They have to look at the neck and look at the spine, look at the ribs, look at the chest wall, look at the abdomen, and look at the adrenal glands, especially for lung cancer.
So these radiologists in this study had a very, very narrow task in front of them not even looking at the pulmonary vessels or the mediastinum. Not even looking at lymphodes. They were just looking at airspace for nodules abutting airspaces trying to match the technology.
This technology forces the radiologist, in effect, to work for it even though we insist the radiologist first do his own job. He then has to come back, read in a skewed way, and correct the numerous false positives, protect the patient from the numerous three per study false positives that this technology causes.
Now, I'm ignoring cost in this analysis. I've been instructed that effectiveness here comes at any cost so I'll leave that for other people to address that but there are still risks because the radiologist has a certain amount of time and he is going to make mistakes.
The fast readings may have, as we've heard from the statisticians, made this very small, though statistically significant. Increase in Az may make it evaporate. I submit that even a larger Az, my own papers have shown, is often not clinically significant for the enumerable reasons that we've touched upon, albeit quickly, because we've mostly spent our time on the red herring on ROC methodology. Red herring for a decision today, I believe, but extremely important for the future of this technology.
If we do move on to the next phase and improve this product, I did not see that it calls for significant training of the radiologists. I think the warnings that will be given to the radiologists are limited and I think the temptation and the ability to misuse the product is significant.
I think that very significant discussion, or substantial discussion interaction with the FDA about what would be appropriate warnings and training and, importantly, post-market surveillance to see how this actually performs with realistic clinical readings, not in the unrealistic setting that here was designed to feed an ROC study.
These radiologists who were safe here were diligent, paid, and focused on eliminating this false positive rate. They did not have to deal with coincident chronic obstructive pulmonary disease, artifacts from patients having the arms by their side. Contrast agent given in large boluses which can cause artifacts, change the appearance of the blood vessels throughout the lungs.
We didn't discuss how the algorithm operates. It sounds to me much like it does not use a maximum intensity projection. It does not identify the vessels per se. It's really looking at ovoid intrusions on airspace. Product development is not -- I don't have enough information to comment further.
Let me see if I have more from my notes and I'll try to wrap up. Well, I've been asked to state my views and I hope it's clear that I am sincerely impressed with the progress that has been made but I think this is an extremely ambiguous and complex project and I am really worried about the real world pressures on the radiologists in that I don't think -- I do not believe that we have shown that we have effectively -- that we have demonstrated effectiveness in that -- effectiveness can come in two ways.
Either improving our accuracy and assuming that we show that we do not increase the false positive rate and that we can effect significantly in a clinical setting the 24 percent false negative rate for real lesions. I think there is evidence there that it's going in the right direction but I really am not persuaded that we are looking at much more than a statistical trend that because of the way the study was conducted statistically reached significance for the ROC.
The other way to reach effectiveness would be to improve the efficiency of the radiologist working so the radiologist would have time to read more carefully. I really do believe that a careful re-read or a second read of these scans might be more effective, accurate, and efficient than the use of this modality.
I believe that we need a placebo study. There is no placebo study where we see the effect of simply introducing the random false positives in a population that is 99 percent negative and see if we do any better at finding that one in 100 who has a true positive.
I believe this is such a statistically based application and we have such a skewed set of circumstances for collecting the data, the data set that we looked at, the way the examinations were done and the very narrow statistic analysis that was done that I do think we have to look at the history of the ROC which is unproven that a p-value for an ROC should justify as proof sufficient effectiveness for FDA approval.
And in terms of safety ignoring cost, I think that we have seen in at least one of the graphs provided that there is possible degradation. We have an intuitive understanding that there is possible degradation and I have no doubt this product will help some patients but I think it may hurt others in direct and indirect ways. I, myself, would recommend -- I, myself, would -- I think I'm supposed to say what I think and I think that -- I would say that I would not think that this is at this point ready for approval. If the panel disagrees with me and it is approved, I would have numerous comments about the labeling that we see in the proposed commercial materials. If I'm supposed to, I would be happy -- I've made some notes on that and I could comment here or leave that for later.
DR. IBBOTT: I think we get to that later.
DR. STARK: Okay. Thank you. Thank you, everybody.
DR. IBBOTT: Good. Thank you.
DR. BLUMENSTEIN: Amazing. It worked. I wanted to say a few words about my thoughts on the statistical concerns, some of which you've heard already a bit this morning.
First of all, I want to say that it appears that the sponsors have done a really excellent study according to today's standards. Nonetheless, I can't escape concerns about the success and impact of the device. These concerns are related simply to the assessing the significance of it. Most of the concerns that I have are rooted in the unique features of the study design rather than the methodology that I think has come to be accepted and used in this area. In other words, there are unique features of this study design that may make this difficult. I'm not concerned about the general statistical methodology and, in particular, the resampling part of it, but I do have concerns about whether all the important features of this study have been taken into account in the resampling methodologies. Let me explain that.
The first major class of discomfort I have is the accuracy of the measures of success. In particular, it's translation to the clinical measures of success. In particular, we see no measures of uncertainty for the clinical measures.
In other words, the Az measure something about device performance and not clinical performance. While we have been given some indications of clinical performance by showing ROC curves, little arrows, and performance points and so forth, we don't have any measures of uncertainty with respect to those clinical measures.
I'm concerned about the sampling for the cases that were included in this study. They were artificially sampled. Population prevalence is likely not reflected in the data set that was analyzed and, therefore, it's difficult to assess a clinical impact of these results without some kind of an assumed prevalence. This is just sort of fundamental in any kind of a diagnostic evaluation. I'm not sure this could be avoided. I'm not sure how to deal with it but it does leave me with some concerns.
Perhaps one of my major concerns is this, that there is a correlation structure having to do with this quadrant implementation which was some kind of a partial localization methodology. I'm concerned that the correlation -- well, add a parentheses here.
That the correlation between the upper and lower quadrants on the right lung, that is the results from these quadrants, is likely to be larger than the correlation between, say, the upper right and the upper left quadrants in the same patient. In other words, there's more correlation within a lung than there is between quadrants of opposite lung.
I didn't see the computations took this into account in any explicit way. I'm not sure how you would. I'm just expressing a concern here. There's a lack of complete understanding of the methods used to analyze this kind of a partial localization maneuver to get to these quadrants.
I'm also concerned about whether the panel, the expert panel, had knowledge of the patient's identity -- I assume that they did. I don't see any evidence otherwise -- so that when they were making a judgment as to the status of the quadrants within a patient, that the results of one quadrant may have left them to feel differently about the results in the other quadrants as they were looking at these things. I don't see that taken into account. I'm not sure how to do it and so on. I'm just concerned about it.
Then I'm concerned about the incremental structure of the study. The instructions to the readers were definitely additive. In other words, they were supposed to use traditional methods and then add the CAD. The computations apparently didn't take into account the correlation between methods. That is, this is a correlation between methods, not the correlation between quadrants of the lung and I didn't see that.
I'm not sure this makes a difference but I'm left with a feeling that it should make a difference and it should have been taken into account because the computational methods of ROC curves and comparing areas of ROC curves and so forth seem to be based on having done independent assessments of the two methods. Therefore, I'm left to wondering whether the p-value would be different had the correlation between methods been taken into account.
And, finally, in this area of concern is the intra-reader variability. The experiment didn't measure intra-reader variability by giving a given reader multiple opportunities to read an image from the same patient and, therefore, you don't know how that read is going to perform. How much variability there is going to be from seeing that same patient over and over and you would want to do that in a way that they wouldn't know it was the same patient if you separated in time and so forth.
But how much would a measure of intra-reader variability modify the p-value associated with Az? I don't know. I was trying to get at that this morning and apparently there's not much understanding of that yet. But my intuition is that the intra-reader variability would be particularly important in the computations of variability for clinical measures. It kind of goes like this, that the artificial scaling of measuring on a probability scale, or however you do it, in order to be able to use ROC methodology depends on assumptions about the performance of the reader with respect to their consistency over use. Yet, the clinical measures that depend on that ROC don't take that into account so I'm left with a bunch of concerns about whether had intra-reader variability been taken into account whether we might be seeing different results.
Then I have just another concern or two, this business about truth. I think it's important to note that the statistical methods absolutely depend on a definition of truth, but I feel that the sponsor did the best that can be done. I have no criticism of that.
But it's important to realize that the results are conditional acceptance of the definition of truth as they got from this panel. Then what was going on was that they degenerated truth and I found that really weird. I couldn't think of a better word for it. Sorry. I wondered why the impact on the variations in readings couldn't have been done.
For example, I would have liked to have seen some co-variate analyses or some sampling of quadrants or sides of the lung. I don't know how to do these. I'm just throwing these up. I hope some statistic students are listening. Maybe it's an area of methodologic research.
The readers are using readers with smaller or larger areas. This could be like a co-variant or readers with more or less experience. Or what I think is particularly promising is maybe you perturb some of the thresholds that individual readers are doing and this might be some kind of a Bayesian analysis whereby you throw in some kind of a distribution of thresholds getting back at that intra-reader variability.
At any rate, I'm of mixed mind. I'm trying to be here but I think I'm here. Where I can read that, it says, "I am a bomb technician. If you see me running, try to keep up."
DR. IBBOTT: Thank you, Dr. Blumenstein.
All right. At this point we will see the questions that the FDA is going to ask of the panel. We will take a break shortly after that. When we come back we'll consider those questions. I believe Dr. Sacks is going to project those questions. When we come back from our break, the first thing we will do is address the questions to the sponsor and hear your responses.
DR. SACKS: Okay. I'll go through these slowly but I think you have printed copies of them. This is more for the audience.
First, please discuss whether the data in the PMA support the conclusion that the CAD can reduce observational errors by helping to identify overlooked actionable lung nodules on chest CTs. In particular, given that use of the CAD produced a statistically significant improvement in ROC performance, please discuss whether:
(A) The use of an expert panel is appropriate for determining actionable nodules given that a tissue gold standard is not feasible.
(B) Actionable nodules are a reasonable target for a lung CT CAD to be judged safe and effective.
(C) The achieved gain in ROC performance in terms of the area under the curve demonstrates safety and effectiveness of the CAD.
Second, please discuss whether the labeling of this device including the indications for use is appropriate based on the data provided in the PMA.
Third, please discuss whether the sponsors proposed training plan for radiologists is adequate. If not, what other training would you recommend?
Four, if the PMA were to be approved, please discuss whether the above or any other issues not fully addressed in the PMA, (A) require post-market surveillance measures in addition to the customary medical device reporting, etc., and (B) suggest a need for a post-approval study.
DR. IBBOTT: Thank you, Dr. Sacks.
All right. We will take a 15-minute break and we'll reconvene at 10 minutes to 3:00. Thank you.
(Whereupon, at 2:36 p.m. off the record until 2:55 p.m.)
DR. IBBOTT: Thank you. We'll continue now with the discussion and we are going to go straight to a response from the sponsor to the questions that were raised before lunch. I believe Dr. MacMahon is going to start with that response.
MR. MacMAHON: Thank you. Again, I'm Heber MacMahon from the University of Chicago. I would just like to start out by making a few points that may clarify some of the issues that have been raised. Let me start by just mentioning a few smaller issues that have received a lot of attention.
Briefly, the question of the placebo effect. Dr. Stark has raised the question whether the need for the observer to review the case a second time after being prompted by the CAD may have actually improved performance because anytime there's a second read, there's reason to believe that additional nodules may be noticed.
However, I think it's worth emphasizing that the average false positive rate of this system is three per entire examination. We're talking about examinations with up to several hundred sections. What the observer does in those situations is not re-read the entire study but go directly to those sections on which he or she is prompted an average of three sections and just look at that particular mark and decide is that a nodule or not.
I would suggest that the opportunity for picking up additional true positives in that situation is really pretty small if one looks at the number of sections and the number of false positives with this system. I would just like to make that comment.
But the larger issue I would like to talk about, and I think it touches on all of the questions that have been raised, is why is the difference in Az so small in this experiment? I think there is a sense of disappointment with what looks like a very strong CAD detection system. We didn't see a larger improvement.
I would suggest if we had a larger improvement that a lot of the questions about the statistical methodology and the design of the experiment would become moot because it would become apparent that such a large improvement could not be accounted for by some of these issues.
We have had a discussion, and both Dr. Wagner and Dr. Stark mentioned, in observer performance tests it's not like real life. I can attest to this. I've conducted several observer performance tests myself, mostly related to digital chest radiography and image processing. I have to say if I had conducted my experiments in this way, I don't think I would have achieved even statistical significance in most cases and I probably would be here now.
Let me explain why. There are a number of factors going on in an observer test. We've already heard how the observers are working in an undisturbed environment. They are highly motivated. They are highly vigilant. These are radiologist.
We are sitting them down and we are saying, "All you have to do is find nodules. We are going to measure your performance and see how good you are. You don't have to look at the mediastinum. You don't have to look at the pleurae.
You don't have to consider interstitial lung disease. We are not going to disturb you. The telephone is not going to go off. The technologist isn't going to tell you there is a patient on the table for a biopsy. A clinician will not stop by and ask you to look for a study." This is an ideal reading environment.
For these and many reasons, the performance in an observer test is extremely high. Basically observers do not miss obvious abnormalities by in large in an observer test. But we know from our own experience and from studies that have been done that radiologists miss relatively obvious abnormalities all the time every day.
That is actually the issue we are trying to address and that is the difficulty in trying to extrapolate from an observer test to clinical practice. I would put it to you that these observers were working on an extremely high level. If we look at the average Az before CAD, the average was 0.88. Some of the observers were over .9.
In my experience when you start at this level in your unaided situation whether it's some kind of image processing or energy subtraction or whatever, it is very difficult to show a substantial improvement. There is not a lot of room left for improvement when the observers are right up there. The situation we do see a large improvement is when they start out at a lower level missing a lot of abnormalities that then they can pick up in the second reader situation.
So what happened here and why did they perform so well? Not only the observer situation but the selection of the cases. In many observer tests and the ones that we quoted in the literature that show a large difference, we tend to go to difficult cases because we know it's only in those difficult cases that our CAD or whatever will make a difference so we go to selected cases, perhaps cases that were missed on the original reading, or perhaps a panel go through and selects chest radiographs that have subtle nodules.
This is a very well accepted way of doing it because we know in those kinds of cases whatever is the modality is likely to make an impact. Although these are selected cases, we know it is usually impractical to take a random selection of the whole population and expect that there will be enough of those subtle cases for the difference to be statistically significant. We do some kind of selection in most cases.
However, here although the cases were selected for having a high probability of nodules, and there was a high incidence of nodules there, they were not selected for having subtle abnormalities. We have to assume that most of the nodules were easy. Most of the observers detected them and, therefore, there was no opportunity for the CAD to show an improvement.
So I think that this is a critical point and to me this explains why that apparent improvement is small in the observer test. I strongly believe that if this kind of a system were implemented in clinical practice where we were subject to these various distractions where obvious abnormalities are missed, there would be a much larger improvement and this would be a useful clinical system.
In that regard, even if the amount of improvement shown in the observer test were going to be the amount of improvement in clinical practice, I would say in my own practice where I encounter a high proportion of patients with pulmonary nodules, certainly larger than 1 in 100, I don't know exactly what the number is but I would say up to half of all the CT scans that I read have either a nodule or a question of a pulmonary nodule.
This is a very pervasive issue that affects almost every CT scan we read. In some screening studies the incidence of nodules has been over 20 percent. Indeed, the incidence of even cancer in some screening studies has been up to 2.7 percent in the initial prevalence screen so nodules are not rare abnormalities.
If I can reduce my missed rate by 15 percent or anything in that area, I would be very happy because that is going to benefit a lot of my patients. I'm going to see a benefit multiple times, probably at least once a day. I would say throughout the whole country the magnitude of that improvement is not at all meager or insubstantial.
On that point I would like to hand off to Dr. Castellino who has some more comments.
DR. CASTELLINO: I just have a few. I think Dr. MacMahon has addressed some of the issues that I was going to talk about but he certainly can do it better and with more authority.
I would like to clear up the issue by consecutive cases. These were consecutive cases. They were not selected for nodules. It turned out that the practice where we got them from had distribution of cases by report with nodule and cases by report without nodules so there was no selection whatsoever.
I would agree that I guess it depends on your practice but if you're in a standard community hospital or hospital setting of some nature, the number of nodules you see on routinely performed CT scans every day on a variety of patients, many of which, by the way, happen to be oncology patients, out-patient or in-patient, is high.
There was a comment made something like, "We don't want to have the radiologist work for CAD." I agree. We don't want to have the radiologist work for CAD. In fact, I don't think the radiologists do work for CAD if the nature of our product is correctly understood.
The only additional work that is required by the radiologist is to go back and review those several slices, two, three, four, five, whatever it might be, look at the circle on the image and determine if it's a true positive or false positive study.
Now, we have not quantified how long additional time that would take but it probably takes in the order of anywhere from -- if there are no marks, of course, it would take no seconds to maybe 15 or 20 seconds. There may be a nodule that is pointed out that the radiologist has to think about and make a clinical decision.
That often takes time but that's perfectly fine. That's the whole point of the product is to get the radiologist's attention directed at something that may be important and then to tease it out and decide what has to be done.
I would echo the fact that a 20 percent or 30 percent reduction in nodules that are missed might represent only a five percent increase in the nodules that I detect. I personally think that is a very substantial improvement in my performance. That is a very important issue.
I think that I would like to say that perhaps when residents of radiology finish four years of training and they go on to a year of fellowship. If we can improve their performance and their subspecialty by five or seven percent compared to the general radiologies of training, that is probably a significant improvement. I don't denigrate the number whatsoever. In fact, I think it's an important number in clinical practice.
There was a comment about the radiologists may not follow the rules. I think it's an important comment. We don't expect that to be the case. Certainly when we introduce a breast CAD product, as far as we could tell they were following the rules pretty assiduously.
Certainly for the masses which the code is not anywhere nearly as perfect as it could be or as robust as it could be. But I think that is probably true of any device you have to consider. Often if there are physicians out there that will use device incorrectly, I don't know how you address that but certainly that is not the point of how our device is supposed to be used.
Lastly, I think it's important to note that should we gain approval and if there are post-market follow-up studies that are recommended, that should be done to further investigate the performance in the real world. We obviously would discuss this with the FDA and would be very happy to do that. Thank you.
DR. IBBOTT: Very good. Thank you.
We are now going to consider the questions and also the panel's questions regarding the presentation. What I would like to do given the time is ask that the first question be projected again. I would like to ask the panel to consider the questions one at a time -- we have the four questions and these have been distributed to us -- and use this as our opportunity to ask further questions of the sponsor as they are relevant to the questions that we've been asked to consider by the FDA.
While Dr. Phillips is getting that up, I'll remind you the first question is to discuss whether the data in the PMA support the conclusion that the CAD can reduce observational errors by helping to identify overlooked actionable lung nodules on chest CTs.
In particular, given that the use of the CAD produced a statistically significant improvement in ROC performance, please discuss whether (A) the use of an expert panel is appropriate for determining actionable nodules given that a tissue gold standard is not feasible.
I would like to invite the panel now to discuss this question. I throw it open to anyone who would like to lead off.
DR. KRUPINSKI: Offhand I would say yes, it is. I mean, I don't see really that many other ways to do it and I think the analysis where they broke down and showed the different ways of doing it, leave one of the observers out, put them back in. I honestly don't think there is any other way at this point in time that you could get at some other truth than using an expert panel. I think it was appropriate.
DR. IBBOTT: I would be interested in knowing from the radiologists on the panel and other people in the room if the variability among the panel that developed the reference standard if that sort of variability seems to you to be typical. I know radiologists don't agree with each other 100 percent of the time. I'm not naive but I do want to know if the sort of differences that we're seeing here if you believe those are representative.
DR. CONANT: On a good day or a bad day? I think definitely. I think they did a very eloquent job of creating the expert panel and coming up with really the best situation possible in this case.
DR. TRIPURANENI: I echo the same comments. As a clinician that is the vagaries of the clinical practice and I think what they defined as actionable module I think is probably the best that we can do today.
DR. IBBOTT: There seems to be agreement then.
DR. STARK: One question along those lines, Mr. Chairman, is that making the most benevolent presumption, I mean, that on its face it looks like they've done absolutely everything that could be done but this is a very, very complicated business of selecting images. All sorts of selection biases, even selecting in the institution and the CAT scanners. There can always be more information on this that I think the FDA should consider.
We have an able group of FDA staffers and so I think how these patients were selected, the institutions that were selected, why not more scans from certain institutions that are clearly generating them if these were consecutively obtained.
I think for any further studies whether they are done for a PMA revision or post-market surveillance, I think more information on why this number of exams from these institutions. I think it should be offered because it will lead to more questions which if nothing else will advance the science that has already become quite sophisticated here.
The other thing is that I do believe that we should learn whether the truth in this case that we are all saying was reasonable when these cases were gathered two years ago, did it work out that way because we know more, or the industry, the applicant knows more about these patients today. I'm very keen to know these people with nodules have had follow-up. These actionable nodules we have proof on.
I don't know if I missed it but I would be keen to know how many of these were reasonably deemed actionable but turned out to be benign and did not change and did not require treatment and how many that were considered not actionable turned out to be cancer.
That's not only important for this product but for its post-market surveillance and the development of new algorithms for improved products in the future.
DR. IBBOTT: Are you asking the sponsor that question?
DR. STARK: If I'm permitted. I really would like to know what using a different -- using the real world clinical definition how many of the actionable nodules were actionable and vice versa.
DR. IBBOTT: Yes, Dr. O'Shaughnessy.
DR. O'SHAUGHNESSY: Yeah, I think that's a very good point. Basically we designed protocol in consultation with FDA. We identified who the people were that would qualify for including in the study. Because of both IRB and other issues, the sponsor is blinded to who the patients are and the follow-up, we collected a prespecified certain amount of information.
If necessary, we can't work with FDA to determine if it's possible and if we could go forward to find out what happened with these patients. Again, they were collected with a certain concept in mind to do the study.
DR. STARK: Thank you.
DR. IBBOTT: All right. Are we ready to go on to the next question? Okay. The next question asks us whether actionable nodules are a reasonable target for a lung CT CAD to be judged safe and effective.
DR. KRUPINSKI: Again, I would say it's reasonable given the caveat that Dr. Stark brought up. If you could follow up on these and find out if they truly were actionable versus not, that would certainly be a benefit. I think that is the most reasonable thing for it to be looking at.
DR. IBBOTT: Yes, Prabhakar.
DR. TRIPURANENI: It's interesting. It all depends on how you define safety and efficacy. I think Dr. Stark called it on this one. As a clinician, to me the effectiveness is ultimately consulate to whether it has any clinical impact. To me, it's really up to the management of the patient and ultimately what he's going to do.
I really can't answer this question at this point in time because I just don't have enough information to say that it is actually effective at this point in time. Yes, the statistics and you picked up a few extra nodules but I really would like to see the clinical data. I do understand that's not how the protocol was designed but I strongly recommend that we really need to look at the information, what is the ultimate clinical impact and the clinical significance.
As far as the safety is concerned, I think Dr. Bill Sacks already raised this question. I think it keeps bothering me that even though once the product is approved, if the product is approved, when it goes into the real world, it's quite possible that there may be somebody might actually get a little slack and actually not use the proper methodology that was recommended. That is, read the whole CT scan unaided before followed by using the CAD system there. If the system is used as it is actually describe, I think it is actually safe.
But, on the other hand, I keep thinking whether there is a way you can actually come back and make sure that the people do it the way they are supposed to do but I can't think of any other way. I don't have an answer. I'm just raising the question. If somebody is not going to use the system as it is supposed to use it, could it be potentially unsafe? I don't know the answer.
I actually have a question for the sponsor. In your 90 patients, 43 patients had nodules. Were there any instances where the radiologist unaided picked up the nodule but the CAD missed the nodule completely?
DR. CASTELLINO: I don't have the number for that but the answer is, of course, it did. The CAD system is not 100 percent sensitive unfortunately. In fact, it doesn't mark a certain set of nodules that the radiologist clearly sees. That's why it is really viewed as an adjunctive review.
To sort of get at the prior comment, which I think is a very good one, let me remind everybody how the radiologist looks at the CT scan of the chest. We give it at least two passes of the entire image, maybe three. One is what we call mediastinal or soft tissue windows looking for abnormalities in the mediastinal chest wall, etc.
One perhaps is bone windows. Sometimes is, sometimes not. And one is at lung windows. At the lung windows we can see abnormalities within lung parenchyma. Now, as we look through those 100, 250 images in a melt-through fashion, cine fashion, I don't know of any radiologists who looks through the entire data set saying, "I'm looking for nodules.
I'm looking for airspace disease. I'm looking for bronchial wall abnormalities. I'm looking for emphysema." I'm looking for this, looking for this, and looking for this. But instead we looked at the lung images globally and we see if there are any features within the lung parenchyma that shouldn't be there.
Nodules, infiltrates, pulmonary infarction, etc., etc. Just that alone means that the radiologist has to look at every lung image either individually or sequentially in some sort of more efficient mode. No. 1 is that that's how it has to be used. In the process a radiologist will detect nodules.
Secondly, the radiologist knows that it's not going to detect all nodules. If it ever got to the point of 100 percent sensitivity, they could use it only the first time as the first reader. We are a long ways away from that. But they still would have to look at all of the lung images to see everything else. I hope I have answered that question.
DR. TRIPURANENI: I guess as humans we are good at pattern recognition. That's what I do. Even though I'm a radiologist and oncologist I keep looking at the CTs and all those things and we are good at recognizing patterns. I guess the computer is not quite dead yet.
I have another question which is the flip side of the other one. How many patients did the radiologist actually say there are no nodules unaided? What percent of those patients did the computer actually say there is a real nodule, that the CAD really helped them to turn the negative nodule patient into a positive nodule patient?
MR. MILLER: I think I'm probably the one with your answer but I didn't quite get it. Would you mind repeating? I think you're looking for a fraction but what's the enumerator and what's the denominator?
DR. TRIPURANENI: What I'm looking for is if the radiologist read the scan and he basically said there are no nodules in any of the four quadrants.
MR. MILLER: Yes.
DR. TRIPURANENI: And when you use the CAD what percentage of those patients were turned into positive nodule patients?
MR. MILLER: Right. That's this issue of the percent reduction in misses, I think. In order to answer the question, you have to make assumptions about what an individual reader's true threshold would be. We really can't do that. We can speculate at what the number would be if everybody's true threshold was 20.
If everybody's true threshold was 20, then they missed things on the first read 16 percent of the time, then on the second read only 11 percent of the time and that's a 30 percent reduction. If their missed threshold was 80, then it's a different number that I don't have at my fingertips. Is that answering your question?
DR. TRIPURANENI: Partly. The absolute number that you picked up was about 4 to 5 percent. I think the improvement whether it is 20 or 80 percent threshold is approximately 16 to 28 percent or something like that.
I'm actually going back to the actual number of patients right there. If somebody has three nodules in one lung, it doesn't matter if you will pick up two more nodules on that lung. What I'm really interested in is the patient who never had any nodules in both lungs that the CAD helped to pick up an extra nodule that would really make all the difference in that particular patient.
MR. MILLER: I don't know the number on that. I can tell you that there were a fair number of patients like that. I mean, maybe about half of the cases in our study only had a single nodule so for that nodule to be identified caused the ratings to go up. Again, I don't know the percentage but there were quite a few cases like that.
DR. KRUPINSKI: Do you know the flip side to that? How many of the absolutely normal patients that the radiologist called normal and then the CAD pointed something out and turned their totally true negative into a false positive and now you've got a false positive patient. How many of those?
MR. MILLER: I don't know that number. Again, I know that there were patients like that but I don't know the number. I would be speculating.
DR. SOLOMON: I think you're hearing the question essentially of how do you translate the statistics into clinically significant issues. That is, changing the patient who is negative into a positive or whether it is significant if you add one nodule and the sixth nodule.
MR. MILLER: I am hearing that and I think that is something we can probably work with FDA on from the data that we collected.
DR. CONANT: May I just say something quickly in terms of answering this question? I think actionable nodules are really the target that we have clinically. It's wonderful to look for a two-year follow-up or biopsy proof but that is not what the task is at hand.
It's are we going to say short-term follow-up. We need that stuff eventually and, yeah, we're all curious about it but in terms of the detection task, it's really an actionable nodule. I agree that this is a good target but, again, I'm concerned about looking at the data by patient, not just by nodule or quadrant because it does make a difference in patient management whether it's nine or 10 nodules versus zero to one so I agree and disagree with that.
The other thing I just really quickly need to comment on is this comparison, I hate to do it, with mammography. But, you know, I see CAD in place of mammography and, yeah, people cheat. That's not what this is about. This is about marketing and education and you can't prevent people from cheating.
That's not really our task here. It does happen but hopefully, you know, people will be better at that. The thing about a chest CT is that this is one task in that chest CT that they are being asked and that this company is addressing so that this idea of cheating, "I'm going to look at the whole CT but I'm not going to look at nodules until I have my prompt." I don't think we as a panel can really go there but I've seen it happen. I don't do it.
DR. BLUMENSTEIN: How do you cheat?
DR. CONANT: I actually have not used --
DR. BLUMENSTEIN: It just doesn't seem --
DR. CONANT: I have to admit I have not used CAD in clinical practice. I am waiting for it to come off the direct digital images in my clinic. It used to be you digitize the images, you had your film screen there, and you pushed a button and your little prompt came up and you didn't have to wait until after you saw the images.
You just pushed the button and you never had to look at the images. Your answer was there. Now, one think that has, or I think potentially could be built into a soft-copy review of digital mammography and chest CTs is a lag time before the information is available, or the requirement to go through the image with multiple window levels and mediastinal and all that other stuff that chest people do.
Potentially in mammography to prevent cheaters you could say, "Okay, you've got to scroll through every image on all the resolutions and stuff before your CAD prompts will come up." Again, that's not what we're being asked to create a safeguard against cheating here, I don't think.
DR. SOLOMON: It's important for safety issues and maybe even a warning that you had to click before you actually -- you know, just a reminder to the average user that this is something that could be dangerous unless you looked at the scan already.
DR. CONANT: But that's education and training and eventually you're liable anyway.
DR. FERGUSON: My question is tangential to this because as I listen to you describe the instrument and its use, you said that -- I thought I heard you say that the radiologist had to go through the scan before he could click on your button.
I mean, is there a fail safe there which keeps the radiologist in -- little or none of these people are around, you understand, but where he could go in and click and get your imaging for the whole lung scan for nodules and then use those as his reference points?
DR. IBBOTT: Actually, Dr. O'Shaughnessy, I was going to invite you to come up and your colleagues to come up to this table so you don't have to keep jumping up and down. If you pull up a couple more chairs, perhaps three or four of you could sit at that table.
DR. O'SHAUGHNESSY: Thank you.
DR. CASTELLINO: I perhaps was misleading when I made that comment or wasn't understood. First of all, let me emphasize there is no fail safe mechanism. We thought about building that in in some fashion. We feel that labeling and training will address it. There are work-arounds if you made everybody look at the lung windows first.
You go through the whole lung windows and push the button so, I mean, we are very -- radiologist are very clever people but I don't think it would work. What I was trying to get across -- I see you would agree with me.
What I'm trying to get across in looking at, you have to look at all the lung windows for a whole host of other abnormalities that are within the lung of which nodules are one feature, let's say, of maybe eight or 10 features that you're looking for. Even if you push the button first and said there's a nodule or two, you still are required to look at everything because you have to do that.
I think radiologist will use it -- will be more likely to use it in the prescribed fashion. With mammography it's different with the CAL code being about 98 percent accurate, it's almost approaching 100 percent, yes, I think some radiologists probably do use it as a first reader for CAL but certainly not for masses.
DR. IBBOTT: Thank you. It, again, appears that we have consensus on this second question, that actionable nodules are an appropriate target for this question.
So then the third question is the achieved gain in ROC performance demonstrates safety and effectiveness of the CAD. We've already been discussing this to some extent. Clearly it does seem to depend on how rigorously the radiologist followed the always and the never rules.
Being people I'm sure that not everybody will always follow the always and never rules. The question is has the company done the appropriate things to encourage people to use this device correctly?
We've seen some of the information that they have provided us today and there is a fair amount more in the information we've reviewed with the labeling that describes the warnings. I would like to ask how you feel if you haven't already volunteered your opinions about the labeling and the adequacy of these warnings if you consider that they are acceptable.
I don't mean to swing us away if you view that question as asking something a little different, but certainly I think that the safety question is at least partly dependent on people following the never rule, not changing their diagnosis based on the response of the CAD system.
DR. CONANT: Just real quickly, I'm very positive about the first two. This one I have problems with, though, because I don't think that we've really definitely showed the effectiveness without looking at this by case. You're actually specifically asking her about ROC performance as the measure of effectiveness. Until I have it broken out by patient, I'm not really sure of that.
DR. BLUMENSTEIN: I see there's two measures we have. We have ROC performance which, I think, is a measure of device performance. Then what we've been talking around and we all seem to have some degree of discomfort with is whether it performs clinically the way that we would expect it to or would hope that it would.
I have misgivings about whether the ROC performance measures are accurate and I have expressed those but I definitely have issues about whether there's clinical safety and effectiveness demonstrated because we don't have measures of confidence bounds on sensitivity or any other kind of measure that shows us an estimate of the clinical efficacy.
Now, I don't know whether the FDA is inclined to give a device approval based on device performance or whether there is a need for demonstration of clinical effectiveness. But as a panel member given the data that I have, I have to say that the answer to C is no for me.
DR. SOLOMON: I have two questions for you that are related to this. The first is that we weren't presented with any data on reproducability of the system. I don't know if you have anything to say about that. If I ran an R2 on the same scan or same patient, is it going to always give the same result?
DR. O'SHAUGHNESSY: In this particular case -- this is Kathy O'Shaughnessy -- the images are digital images so the algorithm will perform exactly the same on the same digital image. Reproducability isn't an issue.
DR. SOLOMON: Okay. And then the second question has to do with the fact that I guess you are currently selling the product in Europe and I'm not sure how many months now it's been that way but do you have any feedback from the physicians in Europe who are using the system? How is it working as far as safety and efficacy goes?
DR. O'SHAUGHNESSY: It hasn't been on the market very long in Europe so we only have a limited number of sites. In terms of safety there's been no adverse events certainly that have occurred with the device. I believe that physicians are very happy with the use of the system. They are not collecting clinical data, as far as I know, that could be supporting this application.
DR. SOLOMON: Do you have any post-market studies of data that you are collecting right now in Europe?
DR. O'SHAUGHNESSY: No, we're not.
DR. IBBOTT: Yes.
DR. TRIPURANENI: Regarding the clinical effectiveness, even though that is not the topic of the discussion, we heard from Dr. MacMahon about what he felt about this. I would like to ask, if the Chairman lets me indulge, Dr. Delgado about his particular clinical impressions.
I'm not talking about the protocol per se. What is your feel having looked at 20 or 30 patients in your institution? Do you think it's going to have an impact on the clinical practice? Perhaps it's not a fair question.
DR. DELGADO: Well, we did not do a dedicated analytical study but we did get basically comments from different radiologist of which I'm one of them that worked with them. We do handle a large volume of CTs per day and multi-slice CT cases.
Like I said, most radiologists found that there were nodules that we missed and increasing nodule detection is something that I think is only a good thing so I think it's effective in terms of what it's stated to be, that is, increasing detection rate of nodules. I felt that it's effective in what it's stated to do.
DR. STARK: If I can touch on a couple of things on this one seed. I mean, we see effectiveness where the radiologists are limited so much in their tasks and safe because they are constrained to just looking at airspace without the distractions under these conditions that we all agree are designed to ask a very focused question designed for this ROC study. But we don't know if the radiologists given whether -- I certainly agree with Dr. MacMahon's suggestion that a more reasonable study group would have 80 out of 100 scans be completely normal and maybe 18 out of that 100 have some other abnormality like COPD, some atelectasis or pneumonia or pleural effusion. And 2.7 out of that hundred should have perhaps solidary pulmonary nodule because we can make arguments here that you have perhaps undersold the technology that it might be particularly useful at helping the radiologist find a needle in a haystack when he's distracted, but it also has to show that is the efficacy argument that has not been proved and it might be better than what you say. It might be worse. The safety argument is under those conditions can you prevent these radiologists from falsely causing additional scans, biopsies, etc., to fight off these false positives when you do have to look at the mediastinum and there is an infiltrate and there is some adenopathy or some post-operative changes. That's one issue in terms of the study population.
I also wanted to mention that I think my colleague, Dr. Solomon's reproducability question is particularly important. What happens after a patient has been operated on? We all agree that the computer is going to run the same file the same way twice absent, again, your flash card got overloaded with photographs of mars. But what about the patient who is scanned on another day and breathed differently or had their arms by their side or had a contrast injection? There must be data available to you that doesn't even require -- each patient serves as their own control. I mean, just go into the archives at Sloan-Kettering and you can come up with 100 scans digitally, run them through your computers, and show here are patients where we have six scans. We have 100 patients that have had six scans and how many of those, if it's 20 percent have an abnormality, did this machine treat that abnormality. That is a very simple, not labor -- not even -- there's no physician work at all. That would really answer the reproducability question in a clinical context and it would show that doctors can rely on this from day to day.
Lastly, I am concerned to hear that this product has been in Europe. Clinical radiologists, especially when something is this -- like surgeons deciding lap cholecystectomy works. It's good for patients. We decide based on word of mouth, anecdotes, and I very much appreciate Dr. Delgado's excellent presentation of his anecdotal experience. It brings this to life but where are the European papers saying, "This has changed my practice. This has made my life easier. I feel more comfortable." There are usually anecdotal reports at levels that have a less of a standard than we have here like the RSNA or national meetings and why aren't they appended as written testimonials at a higher level than, forgive me but, you know, from one user at a beta test site. Where are the published testimonials or anecdotes or clinical case reports in the literature of Europe?
DR. IBBOTT: Yes, please.
MR. MacMAHON: I think there were a number of issues. One was a suggestion of doing the observer test in a different way, perhaps with more normals and with multiple kinds of abnormalities in the spin. I agree that would be ideal in a sense.
I should point out there were multiple abnormalities in the scans that were used. These were not just pristine normals versus typical nodules. I think, in fact, you saw in the really typical classical nodules the results were much more impressive.
A lot of the disagreement among the radiologists in nodule detection, I think, although I didn't participate, was not so much is this a nodule or a vessel. It was does this qualify according to these very specific criteria as an actionable nodule above a certain size and above a certain density, when does it become a scar or when does it become an airspace opacity.
Those are the things we struggle with every day. That was partly a matter of definition. But I think the mix of normals and abnormals was used to maximize the statistical power in the experiment. Of course, one could do more ideal experiments if time and money are no object but this was already pretty extensive. I think that was probably a reasonable approach.
There are some other issues. Perhaps I'll have the other people address them.
DR. CASTELLINO: Well, just a couple comments. It turns out, it just so happens, that half of the patients in the 90 group study, 45, were done with bolus IV contrast injection and the other half were not so we didn't design it that way. It just happened to fall out that way. We saw no difference in the appearance of the nodules. In fact, with contrast you may expect some of these things might be easier to detect.
To answer one of the questions, I would like to reemphasize we didn't cherrypick for clean lungs. We had an independent radiologist come in and rate the lungs as clean, intermediate, or dirty. I don't know the exact numbers but I think something like 15 percent would be dirty lungs, about 30 or 40 percent intermediate, and the other whatever remained would be relatively pristine lungs. As I said before, a number of these patients did have prior surgery or radiation therapy. They were included in the study group.
I would dearly like to go into Sloan-Kettering's Radiology Department or any other radiology department and get a bunch of cases like I used to do and do clinical studies. I can tell you trying to get cases from institutions to do this type of research work is extraordinarily difficult. I know the academic community is very aware of this. We are trying to develop both databases so everybody can have access to it. Let me tell you, this is not a trivial issue. To identify these five sites you've got the cooperation from these people and it is extraordinary and we are deeply indebted to them. I think your suggestion is great. You get me the studies and we'll do the research on it.
Reproducability. I think the issue with mammography, and I don't like to keep bringing this up, but when you're scanning a film the noise within the scanner, the digitizer, is a problem with reproducability. We've done those with film base studies. With a digitally acquired image, there is no issue for the algorithms since it has always worked on the exactly same digital data set.
Going from one patient to the next, it all depends on how that patient is. Two days later the patient may have motion artifacts and what not. The CAD obviously will perform different on that type of case material.
Lastly, there are some reports that were presented with this product. I'll be glad to get them together and ship them out to you guys to take a look at. They are all, of course, retrospective studies looking at cases where red is negative, reviewed in retrospect to see were there nodules in the lung and CAD identified a number of nodules.
One comes out of Brigham at Harvard. Twenty-two percent of the cases were negative for lung nodules, not other abnormalities. They found nodules that they felt were important to recognize in retrospect, 22 percent of the cases. Oh, I just answered that question. Okay.
DR. IBBOTT: How about Nancy since you haven't said anything.
MS. BROGDON: I just wanted to comment. When you mentioned shipping some information out, please make sure that anything that you submit comes to the agency directly. Thank you.
DR. CASTELLINO: Absolutely.
DR. IBBOTT: Dr. Krupinski.
DR. KRUPINSKI: An issue sort of following up on what was already brought up. Not reliability but engendering trust in your users. I notice that when you're reporting the false positive rates on the stand alone you report median. Now, typically median is used when you have a skewed distribution so I'm assuming that you are negatively skewed and your false positive rate the average was higher than the median. Could you tell me what the average was, was it skewed, and then the range of false positives per case. Not just the median because two to three median most people are going to look at that and say average. I think it might be a little bit misleading.
MR. MILLER: I agree. People use the word average sometime to mean either a median or a mean and I think we have to be very careful not to refer to that number as an average because the distribution is skewed and the median and the mean are different. Because there are some patients that could actually have 100 nodules, we don't have a cap on the number of marks. The system could actually find 100 true positives on a given case so we actually do have one case out of the 151 that had 47 false marks.
Now, I think on that case when people hit it they sort of just ignored all the marks because it was just obviously a very, very dirty lung. I don't know the number off the top of my head but I think the mean false marks is four if we are defining false as marks that were not panel findings at all.
If we include some of those equivocal findings, the one-thirds and the two-thirds, I think it may go up to five and the number is different if it's false marks per normal case or total number of false marks. That's in the ballpark of what it is.
There are actually a fair number of cases with zero marks. A lot with zeros and ones and so forth so that's where it goes back to your other questions about correctly localizing.
DR. KRUPINSKI: This is unrelated but did you look at the stand-alone performance was very different from the ROC analysis? You broke this now into classic versus nonclassic. Did you go back and look at the performance data of the observers using that breakdown instead of what was used?
MR. MILLER: Yes, we did, using a cut point of the four-fifths classic so if you -- we don't have the distribution here but it's sort of split out neatly that people are more likely to be on one end or the other so using that four-fifths definition you actually get more of a separation of the curves than we do with what we showed you. I think that is essentially that we have a higher true positive percent and people are reacting more often to it.
DR. IBBOTT: Okay. I think then we'll go on to the next question. Question No. 2 then is please discuss whether the labeling of this device including the indications for use is appropriate based on the data provided in the PMA. This is, again, on the question of are the instructions for use and warnings about the always and never rule sufficient. Maybe we have discussed that enough. I'll see if there are any comments from the panel.
DR. STARK: I have a few. If people could turn to Tab 8. I'm not sure if I've directed myself to the most important place but this is where I've taken off. I think, by the way, since I'm a primary reviewer I should fill in some of these things. Suffice it to say I would like to conclude
-- I conclude from the discussion that I've heard today that the word "significant," that if this product is approved now or in the future, any claim to significance really should be toned down.
I don't know -- I'm not trying to lawyer anybody here. I know there are people in the FDA that know how to do this but I would be offended to see the word. I think there is a future for this technology. I'm not sure today is going to be the biggest step forward but it's definitely positive or negative result in terms of approval.
This is a step forward because there is going to be this technology but I do not think we are close to where I would feel comfortable being part of something where a radiologist is told that this product makes a significant difference. I think this is an aid like a better light bulb in a view box.
I mean, I think it should be -- if you are allowed to sell this, I think the word significant should be in a footnote and only when it's within two words, if you put it in Google, of the word ROC so that we have a significance statistical ROC result in a footnote.
But to tell radiologists this is going to make a significant difference in their practice or significantly help their patients, I think this panel and everybody who has been candid have labored mightily to say that is not a correct claim and it would be misleading. I would rather be on the plaintiff side of a malpractice suit related to that.
Similarly, for example, some of the language that I would use as an example, and, again, I'm not trained in this and forgive me for being blunt. I'm just trying to help because I'm presuming in these comments that there is something to be decided here and we're just talking about language.
The phrase under efficient detection of lung nodules, paragraph 2, second sentence. By the way, here is an example of the confusion. You have clinically significant nodules here and elsewhere the word significant is used and we talk about it being loaded, spun, twisted by our presumed innocence but marketing people will get carried away and you would be on the edge of fraud just due to concatenation. So forget about that word significant, but high sensitivity and low false positive CAD marker rates, I do not see how someone can make that concatenation. That is just to me a little bit too artful.
We have a very high rate of false positives with CAD. I mean, to characterize what we are having as CAD marker rates as low false positive is the exact opposite of the truth. Again, this is my opinion. I would love to parse the language if that is what we are supposed to do here.
Let's see. In terms of improving sensitivity and efficiency, the sensitivity argument, I think that may pass mustard with an asterisk. I don't know that we've shown there is any increase in efficiency at all. I really don't. I think we have said basically to the radiologist read it again.
I would like to -- I appreciate the back and forth and I think everything Dr. MacMahon said is correct and everything that I have said is correct as we, again, focus people on this. You are redirected to a single slice and perhaps the computer work station, whatever it cost, leads you to that slice but no radiologist is going to decide real or not real based on looking at that one slice.
They are either going to tile up the adjacent slices until they are fully through the lesion, or they are going to trackball through it and in most cases human nature you are going to trackball through a significant fraction of the images.
All I can say is touche, back and forth on this. You are not just going bing, bang, boom, there are three slices it picked out. They were all obviously nothing. No way. No way at all that's going to happen. You are going to trackball through it and that's going to take time. I think the efficiency claims really would have to go.
On the next page where it says, "Automatic CAD processing or lung nodule detection requires no user interaction." Again, please, my opinion is that I know some person probably was just being enthusiastic but this requires that the radiologist be responsible for dealing with this snow storm of false positive exams.
It's the worse kind of user interaction. It's the kind of user action that causes radiologists to stop doing mammography or to leave the field entirely. It's like I'm going to say there's all these positives here and I'm going to be a malpractice lawyer's dream. Now you have to bat away all of these snowflakes and take the time to interact.
Definitely have to interact and take the time to do it and be liable. I doubt this is something that should be considered here but the affect on people's ability to read, the psychodynamics that produces these ROC curves, that produces radiologist's performance really is largely affected by people's anxiety and I know there are people here that are expert on that and I'm not.
But I think it's going to make people very edgy and it's going to have a lot of unintended consequences that they are going to be thinking about what's the malpractice lawyer going to do with or without application of this approved technology.
That alone might have a bigger affect on reader performance. Those of us who don't have the machine will be more careful and those that do may or may not be more careful. I think the labeling and the training is extremely important. I know we'll get into the training next.
DR. CONANT: May I say something real quickly?
DR. IBBOTT: Yes.
DR. CONANT: Just a little rebuttal there, Dr. Stark.
DR. STARK: Please.
DR. CONANT: Sorry, David. From experience in breast imaging, I just have to say two marks is not a high false positive rate. When I'm looking at the task at hand which is 300 images, I don't know that's a high false positive rate until I know how it impacts a single patient.
It doesn't sound that bad to me compared to what we're doing with mammography and where we've come and where we're going. I don't think you can jump to say -- I mean, I think I agree with all your other things here but I would be hesitant to say that's too high until we have the data because it doesn't sound that bad unless it impacts those single two patients where there are those two false positives.
DR. IBBOTT: Yes, Dr. Ferguson.
DR. FERGUSON: Speaking of the labeling, I'm looking here and I'm sorry. I apologize. I have gone through here several times -- not talking about the advertisements but the manual that you have -- looking for clear definition of what we saw on the slides which is what I think should be somewhere in here up front, and that is the two slides which we showed about what you must do and what you must not do to use this device. Is it somewhere in here?
DR. IBBOTT: You're talking about the always and never rules?
DR. FERGUSON: Yes.
DR. O'SHAUGHNESSY: I agree it's very important. We're looking for the advice of the panel on this issue and labeling in general. I should comment that particular situation is at the front of your Tab 4 where we've got preliminary warnings and poshuns that would be given to the radiologists. That's where in our mammography product we typically -- these are gone through during your training session to make sure that the information gets across. Again, we would look to work with FDA with your advice from the panel to come up with appropriate labeling for the device to affect both the manual and any advertisement labeling. That is part of what the job is when we finally work with FDA and get a final labeling for the device.
DR. SOLOMON: Two other quick questions on the labeling. One is whether vendors matter. I mean, you have two vendors. There are several others out there and whether or not there's any impact on your system. The second one, as far as labeling goes, whether or not there's an optimal slice thickness and whether or not that should be stressed because maybe the protocol should be changed to optimize your system.
DR. O'SHAUGHNESSY: I can answer that at the high level. If you want to go into more detail, I have the technical people here. Although at the five sites we chose to select cases for the regulatory study, they happened to have scanners from the two vendors mentioned, GE and Toshiba, are separate database cases that was gathered for training the algorithm has representations from all the major CT vendors.
In addition, as part of the approval for a CT machine there are very rigorous controls on the quality of the images. Those type of controls more than adequately make sure that the images are adequate for CAD. I believe that's okay. The second question again? I'm sorry.
DR. SOLOMON: Optimal slice thickness and protocol design for optimizing your system.
DR. O'SHAUGHNESSY: Right. Because the system was designed to address the issue, especially in an information overload situation, we focused the development of the algorithm for slices of 3 mm. collimation or less.
In fact, the system won't process CT images unless they have collimation less than that. Part of it is that's where radiologists are most likely to miss. The other factor is it's a more volumetric description of the lung and so the algorithm is designed to perform in that environment.
DR. TRIPURANENI: I heard Dr. Emily Conant loud and clear that it's not our business to actually decide how the user is going to use the system, but I think I have to agree with Dr. Ferguson. I really would like to see in big letters always and never somewhere loud and clear.
When you look at this fancy color graphics, for somebody not paying attention it looks like you can push the bottom and the machine is going to tell you everything even though it says "improves" and all those things but I think those two points need to come out loud and clear.
DR. STARK: Is there anything in here to give comfort to a radiologist once this product is approved for not buying it? Is there any justification for not feeling bound to use this in every patient whether they have pneumonia, they are in for a car accident, follow-up on a pleural effusion? I'm wondering what type of marketing pressures that we haven't yet seen are going to drive people to feel that they will be left as a wounded calf behind the herd for the malpractice lawyers if they don't take on the burden of using this product for every CAT scan done in America after the FDA gives this it's imprimatur.
DR. CASTELLINO: I thought I got two questions there. One might be, I think, if you have this in your department would you choose to use it on patient A and not on patient B. If they meet the technical requirements, the CAD works in the background.
I think it takes an average of three to five minutes to process the images for the CAD results. If you're reading in a standard fashion, which is not really that much on line, the CAD information will be available to you. You can choose to use it or not to use it.
My suggestion as a radiologist is if it's there and you think it's worthwhile since you have acquired the technology, you probably should use it in every case but this is up to definitely the person who wants to use it.
The second question is a little more difficult to address. I think your question is really saying if I don't have one should I get one. Our experience with mammography and, Emily, I hate to go back to that but I guess I have to, is that the utilization of CAD mammography, which has been approved five and a half years ago or more, has been relatively slow.
I mean, there are many mammography practices that don't have it. In fact, apparently you don't have it. I don't think this is going to force radiologists to get it or not to get it. Just like a 16-channel CT scanner is not a necessity if you're doing CT if you have an 8 or a 4, and some people still have a single slice scanner.
Or having all the probes and ultrasound machine or having all this or all that radiology programs make decisions on what technology they wish to hire. If they think this is valuable, it will help them in their practice, they will acquire it. If they don't think it's any good, they won't. I think the marketplace will decide.
DR. STARK: Shouldn't the labeling of products like this -- this is perhaps a broader question but I think it pertains here -- contain disclaimers so that someone does not make inferences about the standard of care or what is the required minimal diligence of a physician or a hospital who chooses not to be an early adopter of this technology.
DR. O'SHAUGHNESSY: I think that would be up to the panel to discuss. Again, if appropriate labeling is found to be important for this product, then, you know, we'll work with the FDA to include it.
DR. KRUPINSKI: Sort of a tangential question. With mammography now when you use CAD you get extra reimbursement above and beyond. Do you foresee this happening with this as well?
DR. O'SHAUGHNESSY: I think it's a little early at this stage of this technology to figure out what the reimbursement situation will be.
DR. IBBOTT: Let's move on then -- oh, sorry. Go ahead.
DR. CONANT: Can I ask just a real quick technical question? Maybe this is very naive and I didn't understand your illustrations but does the algorithm that analyzes the images, does it come -- I guess can I hook up lots of scanners to it? Is it one box for each scanner or is it one box for each department? I know there are issues with mammography. I'm just curious.
DR. O'SHAUGHNESSY: In this situation depending on how many CT images you are going to feed through, the fact that we've utilized the DICOM standard means it's just an appliance sitting on the network so you just push them from any scanner available in your system and as long as you don't exceed the computing capability of the computer to keep up with your case load, there is no restriction.
DR. IBBOTT: Well, that's brings us to the question of the training program, No. 3. Please discuss whether the sponsor's proposed training program for radiologists is adequate. If not, what other training would you recommend? I would like to start by asking my question about that. I couldn't find anything in the material here that provided a lot of detail about the training.
In particular, how long the training is and how closely supervised it is. You presented a bit more during your presentations but I wasn't sure if that was the type of training you would propose for customers or if that was training for the people who were doing the evaluation.
DR. O'SHAUGHNESSY: I think that is a great issue and good question to bring up. We didn't have the formal training program written up at the time we were submitting the PMA and part of the goal of the training at institutions like Dr. Delgado's was to take a first run at it, assess what changes needed to be made, and then bring that forward.
So the format that he described, it was very similar to what we ended up with which is basically depending on the number of radiologists but typically a site would have one of our specialist there for a day. They would work with the radiologist one on one to go over the manual, in particular the algorithm description.
Every system that ships will have demonstration cases that are good examples of what CAD marks and what it doesn't mark and the type of false markers they are going to see. And then as the radiologists get more comfortable with the system, the shadowing that we talked about where they are there available to answer questions like the radiologist is reading on their own but go, "Why is that mark there?" The applications person can answer that. Then in addition to that, the application specialists usually follow up with the site within a week or two or that training to make sure that no other issues have come up. Of course, we are always available by telephone or e-mail if any issues come up. The general outline of the training program is similar to what we do in mammography and we found that to be very effective.
DR. KRUPINSKI: As sort of a follow-up, Dr. Delgado said that some people weren't there for the training and then some of the other radiologists trained them. Is that enough? Is that acceptable? Because obviously I wouldn't think they would be able to answer some of the more technical questions so how did you feel about that?
DR. DELGADO: That's a good question. We were able to do it quite readily. The training experience that I had with the application specialist was really just three or four hours in the morning. We had some lunch, they were around for the afternoon and stuck around and watched us read and shadowed us.
I think that perhaps that is something that R2 if they want to actually mandate that positions go through the training in that fashion or some kind of course or improvement period. That was not strictly applied in my case as a beta experiment but I see that potentially being used in clinical practice. That is probably a good recommendation.
MR. BURNS: A time is not given. You used eight hours. I would suggest a super user trained at the facility and the production of a training CD so that even though you have new radiologists and staff coming on board, training CDs are not that hard to produce and you have your own project. Three to four hours sounds about right to teach someone how to use this work station.
DR. DELGADO: I should add that is something that we went through. At least the physicians that did receive the course or the small introductory application seminar. We did process, I think, relatively about 15 or 20 cases, some of which were provided by R2 and some of which were from our institution. That is some kind of case load that should be either already prefixed or from the institutions. Definitely valid.
DR. STARK: I think the most important part of training is going to be identifying what causes these false positives and cataloging them because there are going to be -- there's going to be a pattern and frequency of artifacts or anatomic coincidences that probably the company already has some good idea what they are that are going to be very different than the false positives that we train our residents to recognize in the normal practice.
The false positives that the radiologist has to fight off on his own going through the studies are likely to be a very different mix of appearances and locations than the false positives that you are going to see with the device. Also with and without contrast. We have heard that 50 percent of these patients had contrast.
It would be reassuring to actually just see it written down if it's subject to analysis that there are no unique issues post contrast. So in your educational material it might even -- one could even say that someone has to deal with that at the PMA stage but we should see atlases or a CD.
It may not be extensive. It might just be 10 appearances. You have some examples already in the PMA. These are the things that you can expect that you're going to see 80 percent of the time in eliminating these false positives and let's see the 10 or 15 most common variants. A radiologist would train on that in an hour. I think that is an important supplement.
DR. O'SHAUGHNESSY: Yeah. I think that's basically -- maybe I didn't explain it clearly enough but that is basically what the manual does is it goes through examples and then we use those demonstration cases that were chosen to give a representative and range of the types of both true and false positives that you see on CAD.
DR. IBBOTT: Dr. Delgado, in your experience with the system did you and your colleagues -- I guess I should ask how long do you feel it took before you became familiar with these sorts of presentations of false positives? Did you find it a complicated process?
DR. DELGADO: No, I did not. First of all, one of the comments by Dr. Stark was in my experience in the cases that we processed from our institution many of them were CT contrast-enhanced pulmonary angiography studies. Many of them were contrast enhanced.
And we also had many cases that were for lung nodule workups in oncology patients where lesions were detected in chest x-rays. We noticed no significant difference in false positive rates based on contrast or no contrast.
DR. STARK: When you say we noticed, you're talking about an anecdote?
DR. DELGADO: True. That's my experience and those are my colleagues. As far as the false positives -- is that your question? -- recognizing artifacts or false positives, I believe that -- I mean, those are normal things that radiologists have to look at now on a daily basis.
We have artifacts that are either generated from noise or from post-operative changes, from other technical parameters such as contrast coming into the SBC and being rather dense. I don't see a particular difference that the CAD would present perhaps a false positive mark. The radiologist decision making upon that CAD mark is no different than something that he might have identified himself. That's my perception of the issue.
DR. IBBOTT: Any other comments about this question before we go on to the next one? All right. We'll go on to the fourth one.
MS. BROGDON: Dr. Ibbott, could I ask the panel to go back to question No. 2, please? Part of our intention in asking this question was that the panel also address the indications for use. Do you believe as a panel that the requested indications for use are appropriate?
DR. IBBOTT: And you are referring to the published indications from the sponsor?
MS. BROGDON: That's right.
DR. CASTELLINO: Tab 1, page 1.
DR. IBBOTT: And this is being presented also in the sponsor's presentation. Are there comments about this? From a physicist's point of view it seems straightforward but perhaps that's not the appropriate -- I'm not the appropriate reader for this. It's the person who would be using the system.
DR. SOLOMON: This may be a good place to include the always and never thing that we've been talking about. I don't know if this is the appropriate place.
DR. KRUPINSKI: This is also why it would be interesting that we could have seen the difference between the classic and the not classic. Here you're talking more about classic nodules and performance based strictly on those to see if this truly is appropriate.
MR. MILLER: Just to clarify quickly, the primary analysis is based on all unanimous nodules. It was one of the sensitivity analyses that --
DR. KRUPINSKI: Right, but not all of those were classic.
MR. MILLER: That's correct.
DR. SOLOMON: The only other thing, I guess, is to possibly emphasize the fact that somebody doesn't realize that ground glass nodules would not be included in this. If I just read it kind of casually, it's a solid pulmonary nodule, I might think all nodules would be included, whereas you might want to distinguish the fact that the system is not meant for ground glass nodules or other things.
DR. KRUPINSKI: You mentioned satisfaction of search here and I'm just wondering if there is a reverse. You are going through -- there's all these other abnormalities. You note, yeah, there's atelectasis back here. Then you go and you bring up the nodules. Has anybody looked at the possibility that you are going to get a reverse SOS and now you're all concentrated on the nodules and you forget to report the initial findings. Has anybody looked at that?
I mean, if you're not going to give your report, you know, if you're not going to sit there and dictate before you look at the CAD, there's the possibility that now you're all wrapped up in the CAD and all of a sudden the other stuff goes out of your mind. Clinically do you see that happening?
MR. MacMAHON: Well, I haven't actually used this system so I'm just speaking from general experience. Of course, in reading CT scans, as I think Dr. Castellino described, we go through it multiple times already.
We go through the mediastinum and personally I make notes, or my resident makes notes as we go through because it's really hard to remember all of the abnormalities in all of the areas so I take a second run or a third run and look for pulmonary nodules and abnormalities and make more notes. My instinct is that it would not be an issue.
DR. TRIPURANENI: As I read through this again, I guess now that we have Dr. Stark's comments and others, the second paragraph is an interesting paragraph. I don't want to wordsmith. That is certainly not my expertise.
If you look at the first sentence in the second paragraph, it kind of vents with the other recognized causes of a suboptimal view. I'm just raising the question. Potentially a radiologist could actually bar the system by looking at the indications. I'm not saying he will. The hole in the whole system there he could actually say, "I can slack off a little bit.
The system is going to pick up the nodule there." I think once again always and never are very important to really put it on the face kind of stating it every single time. The whole system is predicated on those two.
DR. CONANT: I think there could also be further contraindications in the warnings and precautions. I mean, again, just emphasizing the always and the nevers but I'm not sure you can dictate what people actually do.
DR. STARK: But isn't it fair to say that given the combination that they are making a claim here that it relieves you of fatigue and distraction or other recognized causes of suboptimal review. I mean, these are bold statements that are going to be used by marketing people to radiologist to look at this.
DR. CONANT: Where does it say "relieves?"
DR. STARK: I'm sorry. It lapses. I misconstrued it. The chance of observational lapses by the reader due to fatigue. Well, the next patient that the same radiologist read after having to deal with these false positives, one could make an argument there's more risk to the next patient.
DR. CONANT: One way to deal with this is basically the second paragraph nobody really likes a lot because who wants to read about our lapses and fatigue, right? Maybe that's not necessary here if always and never is emphasized. Is that happy?
DR. STARK: I think if the FDA has our point that we are unhappy with the language, I'll leave it at that.
DR. CONANT: We don't like to be called tired and distractable.
DR. IBBOTT: Mr. Burns.
MR. BURNS: If I remember correctly earlier during your presentation, you indicated this algorithm does not work with low dose chest CT. Correct?
DR. CASTELLINO: No, I did not. I said that the clinical cases that were collected for the ROC study were all clinically indicated studies. That is, they did not contain any type of screening low-dose exam. In out test database a substantial number of the cases are, in fact, low-dose CT scans and performs quite well in that, or equivalently well in that. But specifically for the ROC study they just happen to be clinically indicated exams like you see in most hospital practices or out-patient practices.
MR. BURNS: Okay. So what you have in the warnings regarding the MAS levels covers that issue. Correct?
DR. CASTELLINO: Correct.
DR. IBBOTT: All right. Then let's move on again to the fourth question. I think we have an indication where we're going on this one, too. If the PMA were to be approved, please discuss whether the above or any other issues not fully addressed in the PMA (A) require post-market surveillance measures in addition to the customary medical device reporting. Several people have suggested that they would like to see additional studies done if this device were to be approved. Those of you who have called for that, would you like to elaborate?
DR. STARK: Well, I've mentioned -- actually seen data. I'm not inclined to argue with the perceptions because I think it's likely correct that low-dose contrast but the public needs to see this. This needs to be written down somewhere so it's objective and hopefully some statistics can be applied to it.
Artifacts due to common thoracic interventions such as excision of one of these nodules, a clip left behind, radiation and damage, patients who can't put their arm over their head. I think those are the major things that are medical in nature. I think one of the things -- there needs to be something negotiated with the FDA in terms of minimum.
You've got already minimum CAT scan or technology but as CT technology evolves what would trigger a change in surveillance. It may be a different category but under this if this PMA were approved, again, the technical experts at the FDA need to negotiate what is some minimum quantum change in the technology that would require a new PMA and review. Is it going to remain a class three device or what would it be? Is it going to be a 510(k) application of substantial equivalence?
Again, I alluded to earlier I don't know what algorithm is used here and I'm not a computer scientist but what is a trivial change to a layman may be very significant to a copyright attorney or a radiologist. If the algorithm switched entirely to being, say, a MIP of subtraction or something like that, at some point there has to be some disclosure and review, I would think, of the performance.
DR. O'SHAUGHNESSY: Can I just comment on that last point? They are very well established guidelines that FDA has with manufacturers as to what requires a change. Any change in the product has to be evaluated against certain criteria and then those will be based on the approved labeling. Everything that the panel contributes here today will go into deciding what changes in the product require further review by FDA.
DR. STARK: Well, then for the FDA's sake I'm not aware of what those are and they will do diligently well to merge that with some of the insights we have learned today because certainly we've heard a lot of novel things today that are novel to everybody in this room. They are going to be novel to the people that developed those guidelines perhaps with the breast nodule detection in mind but they may not be totally opposite here.
DR. CONANT: The things that I raised before just to summarize, and I'm not sure where they fit in preapproval or post-approval because I'm not sure if we made that decision yet but, again, it's a case-based analysis versus multiple nodules, quadrants, all that. You've heard that multiple times. A little more insight based on case-based analysis of false positives and false negatives.
I think that's really important. We've been talking a lot about the false positives but I think the false negatives are fascinating. What happens when you've got really defuse lung disease? One of the exclusion criteria here was greater than 10 nodules. I mean, what about someone who has -- I don't know what disease that would be but a gazillion -- yeah, sarcoid, right. Granulomas everywhere, old TB, whatever. Where can this really be used effectively and where does it really just fall down.
Also your cases were over 19 years of age. What happens in the pediatric? You know people are going to start applying this everywhere. That just came to me recently. That has to be included, I guess, in the labeling and certain analyzed. Whether it's pre-post approval, I mean, that's what we're here for.
DR. SOLOMON: I would just add the thoughts on making the study more real life so collecting data maybe on the perspective fashion that will essentially test the system in real life conditions. Real-life conditions for the doctor, real-life conditions of diseases and everything, and I guess a real-life test essentially.
DR. TRIPURANENI: I would recommend the same. I think whether it's pre or post I think there needs to be a follow-up study of the patients that are going to go through this to see what is the clinical impact ultimately.
DR. IBBOTT: All right. Well, I think we're on the verge then of deciding if it's going to be a pre or a post-approval study. Unless there are other concerns that you want to address now, I suggest that we move on.
We now come to a second half-hour open public hearing session. If there are any individuals wishing to address the panel, please raise your hands and identify yourselves at this time. Seeing none, then we move on.
Before we move to the panel recommendations and vote, is there anything additional the FDA would like to address?
DR. DOYLE: Now that the panel discussion is over, we would ask the sponsors to go back to their seats, please.
DR. WAGNER: Fear not. I will not make a technical comment but since Dr. Blumenstein's position is heavily influenced by some of his statistical comments, I would just like to tell you that the issue about correlation across modalities has been addressed in the literature by a number of authors including myself and it's at the bottom of the third page of the references there.
Modalities are not a random effect but cases and readers are. The entire correlation structure is accommodated by the model here. Also the sampling scheme does sample the intra-reader variability, as I said this morning. Two out of three of your points are, in fact, addressed in the literature. Thank you.
DR. IBBOTT: And, finally, is there anything else the sponsor would like to address?
DR. O'SHAUGHNESSY: No, thank you. We appreciate the questions very much.
DR. IBBOTT: Thank you.
DR. DOYLE: All right. We will now move to the panel's recommendations concerning PMA P030012. The Medical Device Amendments to the Federal Food, Drug, and Cosmetic Act (the Act) as amended by the Safe Medical Devices Act of 1990, allows the Food and Drug Administration to obtain recommendation from an expert advisory panel on designated medical device premarket approval applications, PMAs, that are filed with the agency.
The PMA must stand on its own merits and your recommendation must be supported by safety and effectiveness data in the application or by applicably publicly available information. Safety is defined in the Act as reasonable assurance based on valid scientific evidence that the probable benefits to health under conditions of intended use outweigh any probable risks.
Effectiveness is defined as reasonable assurance that in a significant portion of the population, the use of the device for its intended uses and conditions of use when labeled will provide clinically significant results.
Your recommendation options for the vote are as follows: Approvable if there are no conditions attached. Approvable with conditions. The panel may recommend that the PMA be found approvable subject to specified conditions such as physician or patient education, labeling changes, or further analysis of existing data. Prior to voting all the conditions should be discussed by the panel.
Finally, not approvable. The panel may recommend the PMA is not approvable if the data do not provide reasonable assurance that the device is safe or if a reasonable assurance has not been given that the device is effective under the conditions of use prescribed, recommended, or suggested in the proposed labeling. If the vote is for not approvable, the panel should indicate what steps the sponsor may take to make the device approvable.
DR. IBBOTT: All right.
DR. TRIPURANENI: May I ask you to read the effectiveness statement again please? I want to listen to it again.
DR. DOYLE: I would be happy to do that. Effectiveness is defined as reasonable assurance that in a significant portion of the population, the use of the device for its intended uses and conditions of use when labeled will provide clinically significant results.
DR. TRIPURANENI: Thank you.
DR. IBBOTT: Would anyone on the panel care to make a motion?
DR. BLUMENSTEIN: I move "not approvable."
DR. IBBOTT: It's been moved not approvable. Is there a second to this motion?
DR. STARK: I'll offer a second.
DR. IBBOTT: I'm sorry?
DR. STARK: I would offer a second.
DR. IBBOTT: All right. It's been moved and seconded. Is there discussion then of this motion?
DR. KRUPINSKI: Can we discuss the procedure? Do we discuss it --
DR. STARK: And then vote.
DR. IBBOTT: And then we will vote.
DR. KRUPINSKI: On that motion?
DR. IBBOTT: On this motion.
DR. KRUPINSKI: And then it takes two-thirds to --
DR. STARK: Majority.
DR. KRUPINSKI: Majority.
DR. STARK: If that motion doesn't pass, then we'll ask for another motion.
DR. CONANT: I'll say something. I think there is a lot of very rich data here. There's more data we'd like, of course, that they don't have like follow-up studies to your follow-ups. You know, what happened to the patients. But within the data that they've given us, I'm sure they can look at it by case and look at false positives, even false negatives.
I would hesitate to jump yet to not approvable without at least getting that data that should be obtainable without IRBs and all that stuff because you guys should have it on those spreadsheets by patient and have a second look at that. That's where I stand with the non-approvable part.
DR. SOLOMON: I agree with what you just said. I mean, I think we're put in a difficult position here. I think all of us seem to be asking for more clinically relevant case data. It seems like something you might have but we don't have that information right now. That's difficult when the statement for efficacy says clinically significant results and it's hard for us without having necessarily those clinically relevant information. I think that pretty much sums up the issue right there.
DR. TRIPURANENI: As a clinician we have high-tech in radiation therapy using lots of machines and equipment to follow-up things right in there. Having practiced for more than 20 years, I have come to believe that any process you improve typically improves the patient outcomes. Sometimes I believe it's a leap of faith but I think most of the things that you do in the clinic that you improve usually improves the outcome.
I would like to believe that actually the fact that you can actually pick up a few more modules I think eventually will translate into some sort of positive impact on patient management. I really would love to see some data. In fact, that's where I have the dilemma. I asked Mr. Doyle to repeat the effectiveness statement right there.
I think if you follow the rule of the law right there, I have to make a real leap of faith that actually this is improvement. My personal belief is that any improvement in the process will improve the care so I have to really make the leap of faith to actually work for it but I think it's a dilemma, as Dr. Solomon said, that we're all in. I really would love to see some clinical data.
DR. KRUPINSKI: Just to be specific, I think what we're after is on a patient basis how many normals were then converted to a false positive to abnormal and then how many false negative patients and back and forth on each one. I mean, all possible combinations. I think that is specifically what we're looking for.
DR. STARK: If I could offer another analogy, a brief one.
DR. IBBOTT: Are you addressing the motion?
DR. STARK: Yes, I think so. I'll be brief. The issue of approving gadolinium DTPA for MR scanning of the brain was obvious, as it is here, as we've heard from statisticians and clinicians. Given the constraints of this study it's really obvious to us that this technology likely makes things better. But unlike the decision to approve gadolinium at a cost of billions of dollars because we saw a few anecdotes where it made things better, no one had an argument that it could make things worse or make things less efficient. Here there are serious concerns that the marginal improvement in efficacy which is perhaps buried in the statistics is offset by a much more obvious risk to the patients here. Forgive me if that's not on the point of the motion but I think the panel has done a lot of soul searching and that's the reason why I think we have hesitated -- my hesitation.
DR. IBBOTT: It seems to me that this device provides information that is not available otherwise and more information is usually better. I share your concern to some extent. Certainly not to the degree that you do, I think, though, that people may misuse the device or take advantage of it to relax in their own vigilance. I think the sponsor can address that.
Yes, Dr. Ferguson.
DR. FERGUSON: It seems to me that the company has followed very carefully the suggestions of the FDA and I applaud them for that. I don't think that we should necessarily penalize them for that unless that's the will of the group here because we are advisers only to the FDA. I would side with those who think that more information is required and I think it's been outlined very, very well what that information should be but I don't think -- I would not vote for nonapprovable.
DR. IBBOTT: Is there anymore discussion before we prepare for a vote?
MS. BROGDON: Dr. Mehta?
DR. MEHTA: Yes, I'm here. I can hear the conversation.
MS. BROGDON: Do you have a comment?
DR. MEHTA: No, actually I don't have anything to add at this point.
DR. IBBOTT: All right. Well, in that case we will proceed to the vote.
MS. BROGDON: Dr. Mehta can vote if he wishes.
DR. MEHTA: Actually, I'm uncomfortable voting because quite a bit of the time it was breaking up and I feel it would do an injustice to the sponsor for me to vote if I've not heard everything clearly.
DR. IBBOTT: All right. Fair enough.
DR. SOLOMON: Can I ask one question? As far as the categories go if the nonapprovable and the approvable with conditions, where would going back to your data and coming up with some of this clinical evidence that we're asking for fall into?
DR. IBBOTT: Well, at the moment we are voting on a motion to declare this application not approvable. If that motion passes, then that's the end of the discussion here.
DR. DOYLE: But I think Dr. Solomon's question is where would reanalysis of existing data?
DR. STARK: Yes, that was the question based on your definition.
DR. DOYLE: That could be part of approvable with conditions. That comes under that definition.
DR. STARK: Would non-approvable also invite the manufacturer to resubmit answering the same questions? This doesn't go away forever.
DR. DOYLE: No. In fact, if that were the case, we would ask each one of you to recommend what you think the sponsor should do to make the advice approvable.
DR. MOORE: Can I make a point? I would also second Dr. Conant's point that I think a lot of the data that's being asked by the panel is in the data that the sponsor has available. I think that really should be taken into consideration. Particularly if we're thinking about additional studies here whether it be post-market or pre-market. Obviously if it was non-approvable that would be pre-market. We really need to think about the reasonableness and what it would take for a sponsor to do that.
I think the companies worked very well with FDA in trying to identify what is appropriate. I think it's not only FDA that's kind of worked on that but sort of the industry of what's appropriate for evaluating this. I think that needs to be taken into consideration.
DR. IBBOTT: All right. We will proceed to the vote then and I'll remind you that the motion is not approvable. I'll ask you to state whether you vote yes which means that you are in favor of declaring not approvable, or no in which case you disagree with the motion and would consider a different motion, or abstain. We note that Dr. Mehta has abstained. Dr. Krupinski, I would like to start with you.
DR. KRUPINSKI: No.
DR. IBBOTT: No. Thank you. Dr. Conant.
DR. CONANT: No.
DR. IBBOTT: Thank you.
DR. FERGUSON: No.
DR. IBBOTT: Dr. Solomon.
DR. SOLOMON: No.
DR. IBBOTT: Dr. Blumenstein.
DR. BLUMENSTEIN: Yes.
DR. IBBOTT: Dr. Tripuraneni.
DR. TRIPURANENI: No.
DR. IBBOTT: Dr. Start.
DR. STARK: Yes.
DR. IBBOTT: All right. Well, we have two in favor, five opposed, and one abstention. This motion does not carry. We now come back to entertaining another motion. I would like to ask if someone on the panel would like to make a motion.
DR. KRUPINSKI: Approve with conditions.
DR. IBBOTT: The motion is approve with conditions. Is there a second?
DR. FERGUSON: Second.
DR. IBBOTT: Dr. Ferguson. Now, we've had quite a bit of discussion but perhaps other Dr. Krupinski or Dr. Ferguson would like to speak to the motion. I'm sorry. The next step is to establish the conditions.
DR. KRUPINSKI: One at a time?
DR. IBBOTT: One at a time, yes.
DR. KRUPINSKI: One condition would be for the post-analysis of the by-patient data
DR. IBBOTT: And each condition requires a second.
DR. CONANT: Second.
DR. IBBOTT: Dr. Conant seconded. Now, is there discussion about this condition that would be attached to a motion to approve with conditions?
DR. STARK: My question is does the motion imply or should we specify that we are saying that's a condition where the FDA must be satisfied before the product is permitted to be marketed?
DR. IBBOTT: That is the meaning of conditions, that it is approvable once the conditions are satisfied.
DR. STARK: Okay. And approvable means it would be subject to FDA approval?
DR. IBBOTT: That's right. We're making a recommendation to the FDA which they then consider.
DR. TRIPURANENI: Dr. Krupinski, could you elaborate the condition? I didn't understand. I'm sorry.
DR. KRUPINSKI: Basically what we want instead of the ROC analysis based on the quadrants is to say, okay, here is a patient who is classified as normal. How many times did the radiologist call that normal and then because of the CAD called it false positive. And vice versa where they initially called it false positive did the CAD make them now call it true negative.
Then how many patients no matter how many nodules they had radiologist says false negative, the CAD correctly turns them to true positive. And vice versa how many times did the radiologist call it true positive and the CAD made them reverse their patient decision and call it false negative.
DR. TRIPURANENI: Are you asking for a post-marketing analysis or a pre-market analysis?
DR. KRUPINSKI: No, re-analysis of the existing data.
DR. TRIPURANENI: Okay. Thank you.
DR. STARK: Is it also implied that the FDA -- that's a specific question but I think it is implied -- I'm asking is that implied that is to -- certainly not to the exclusion, I would think, of the many other questions the FDA might have based on our discussion today, or should we add our own conditions and try to broaden that? I think so many things have been raised here today.
I'm so impressed personally with the qualifications of the FDA staff, the clinical staff, Dr. Sacks, the statisticians, that I would want to give them broad discretion and encourage them, in fact, insist that in addition to answering your question that they address many of the other issues that they will see fit to recognize in the transcripts of this proceeding.
DR. KRUPINSKI: I'm not sure how broad each division has to be.
DR. DOYLE: There's no requirement either way. Keep in mind that the FDA will interpret these conditions so that you can state them in broad terms and we certainly will work with the sponsor to refine them to specific actions. You don't have to spend a lot of time wordsmithing these conditions is what I'm basically saying.
DR. IBBOTT: Dr. Blumenstein.
DR. BLUMENSTEIN: Let me have clarification here. Are we talking about conditions prior to approval or post-approval conditions? I'm a little confused about that.
DR. IBBOTT: These are conditions prior to approval.
MS. BROGDON: If you have post-approval conditions you want to include here, then you should.
DR. IBBOTT: Thank you.
DR. KRUPINSKI: So those would be like follow-up on new patients. That would be a post-approval?
DR. IBBOTT: A post-approval for condition for approval.
MS. BROGDON: I'm sorry. I didn't understand your question.
DR. IBBOTT: If we impose conditions that cannot be met until after the device is marketed, then how can that be a condition for approval? Or is it a recommendation at that point?
MS. BROGDON: These are all recommendations. If some of them are about post-approval data, then just identify them as such and we'll know how to sort them out.
DR. IBBOTT: Thank you.
MS. BROGDON: If you have things that you are specifically looking for, you ought to name them in your conditions.
DR. IBBOTT: Good.
DR. CONANT: I think things that are pre-approval conditions before we get to post-approval.
DR. IBBOTT: Let's deal with them one at a time.
DR. DOYLE: Let's try and deal with this one condition.
DR. IBBOTT: By the way, we need to vote on each condition so before you --
DR. CONANT: I seconded hers, didn't I?
DR. IBBOTT: Yes. And are you speaking to that condition?
DR. CONANT: No.
DR. IBBOTT: Let's vote to make sure we're in agreement to attach this condition and then we'll come back and add more conditions. Is there any other discussion about this condition? Then let's ask Dr. Mehta again if he wishes to vote on these conditions.
MS. BROGDON: Dr. Mehta, do you wish to vote on any of the conditions?
DR. MEHTA: I think I'm going to abstain on that as well.
DR. IBBOTT: All right.
DR. SOLOMON: The only other thing on her condition is to -- I mean, it was a very broad statement. Obviously the implication is that the statistics remain favorable on the case analysis. I mean, it's implied.
DR. IBBOTT: Good point. Yes.
DR. KRUPINSKI: Yes.
DR. IBBOTT: Thank you. Dr. Conant.
DR. CONANT: Yes.
DR. FERGUSON: Yes.
DR. SOLOMON: Yes.
DR. BLUMENSTEIN: Yes.
DR. TRIPURANENI: Yes.
DR. STARK: Yes.
DR. IBBOTT: All right. Unanimously in favor of that condition.
Now, at this point, Dr. Conant, you could introduce another condition.
DR. CONANT: Always and never. Labeling issues. I thin everybody agrees on that to clarify the labeling addressing the many issues we did.
DR. IBBOTT: Is there a second?
DR. KRUPINSKI: Second.
DR. IBBOTT: It's been seconded. Do you want to elaborate on just how you would like them to do that?
DR. CONANT: Nobody really liked the second paragraph about fatigue and lapses and to really emphasize this always and never and to have the radiologist be ethical and moral and all those good things. And to really downplay the issues of statistical significance, to try to lay off that if possible.
I think even right now the efficiency issues we don't really know that or we haven't quanitated that so I wouldn't go there either. Not even soft pedal I wouldn't go there. I'm sure other people have other things to include in that condition.
DR. IBBOTT: Dr. Krupinski.
DR. KRUPINSKI: I think we should maybe consider the possibility of adding the always never to the software. Not only are you trained on it but, say, maybe every 20th case because you can keep track of who logs in, the reminder comes up so it's made a part of their conscientiousness and you just don't have it in that initial three-hour training session because no one is going to read the manual. We know that so if it's not in the initial three hours. In addition as a later reminder.
DR. IBBOTT: Any other comments regarding this condition? All right. Then I think we are ready to vote on this one.
DR. KRUPINSKI: Yes.
DR. CONANT: Yes.
DR. FERGUSON: Yes.
DR. SOLOMON: Yes.
DR. BLUMENSTEIN: Yes.
DR. TRIPURANENI: Yes.
DR. STARK: Yes.
DR. IBBOTT: Unanimously in favor again. Then we'll -- oh, I'm sorry. Dr. Mehta. He's abstaining from all these, we think. One abstention.
Would someone like to entertain another condition?
DR. FERGUSON: The issue of formalized training for those that are going to use the device. I like the idea of a CD-ROM. I don't have to spell those out. Everybody knows what those would be. Most of the panel feels that it's appropriate to spell out a time. I don't think it's necessary for this device personally.
DR. IBBOTT: Are you suggesting that the condition mandate training when the device is sold?
DR. FERGUSON: Yes, I am.
DR. IBBOTT: Is there a second?
DR. KRUPINSKI: Second.
DR. IBBOTT: Dr. Krupinski. Anymore discussion about this condition?
DR. TRIPURANENI: Could you elaborate, Dr. Ferguson, what exactly in broad context. You want the technicians to be trained and you want a CD-ROM to be given with some cases of false positives, false negatives?
DR. FERGUSON: Yes. I think we've talked about all of those things before. I can't remember all of them or elaborate on them but I think they have a clear idea of what we need to have rather than somebody buys the instrument and puts it in. I think we need a little more than just having a technician, if you will, or an M.D. even. I don't know what level this person is that goes in for two or three hours to train. This will be protective for you as well as the patients.
DR. IBBOTT: I'd like to comment. Also I support this and I would like to see the sponsor consider some sort of remote review. This is digital data with DICOM. There probably are mechanisms that a review could be done sort of looking over the shoulder but from a distance so that it wouldn't necessarily -- the training session wouldn't be restricted to the time that the company's representative is on site.
Any other comments? Okay. Then we'll vote on this motion. Dr. Krupinski, we'll start with you again.
DR. KRUPINSKI: Yes.
DR. CONANT: Yes.
DR. FERGUSON: Yes.
DR. SOLOMON: Yes.
DR. BLUMENSTEIN: Yes.
DR. TRIPURANENI: Yes.
DR. STARK: Yes.
DR. IBBOTT: One abstention and the remaining all vote yes. All right. Are there other conditions?
DR. TRIPURANENI: I'd like to propose a first marketing surveillance. The reason for that is I think the amount of patients that they have even though they are going to do the pre-marketing analysis of the data, I'm afraid we may not have enough number of patients to really tell us what is going on there. They looked at the quadrants and the number of nodules increase and all those things. When you look at alive human beings and the clinical impact, the significance is going to change. I think it's going to be really small.
I would like to propose that we give the broad description to the FDA to kind of come up with something in their best judgment post-marketing surveillance where they can actually track that it really have a clinical significance.
DR. KRUPINSKI: Second.
DR. IBBOTT: Thank you. Any discussion?
DR. CONANT: I think this is part of this. I'm interested in the impact of the CAD and other disease detection. I don't quite know how to do this so I would want panel members to help with this. For example, ground glass opacities and things like that. I wonder if this might not impact one's detection of some of these other things.
Again, it's broadening the population and I would recommend that they do a study with less strict criteria looking at a more prospective group and analyzing the impact of the CAD on the interpretation. Why you would have to look at the interpretation before application of the CAD of all diseases and look at it after. I don't know if that is of interest to anyone else.
DR. SOLOMON: I think that is essentially what the post-market study would be is to look at any changes that come about as a result of the CAD usage.
DR. CONANT: Very general, right?
DR. KRUPINSKI: Not just on nodules but other things as well.
DR. CONANT: Yeah, like mediastinal adenopathy. It's that distraction aspect I think someone brought up earlier.
DR. IBBOTT: It would be difficult for us to design a useful study in the next 10 minutes.
DR. STARK: But is a potential condition of approval to limit its approval to patients like those studied and perhaps data can be shown to the FDA so it could be approvable for use with contrast media. We've heard that's possible and we haven't voiced any objections to that but conditional approval that it not be applied to patients with obvious artifacts, other lung disease such as ground glass nodules or pneumonia. It hasn't been studied in children and I don't know if we're obligated to point that out and ask for that.
DR. CONANT: They did have other diseases in their first group but they didn't look at how the -- there were others, emphysema, ground glass, post-op, all that stuff. I'm not sure you can restrict it.
DR. STARK: Have they shown us enough that they can market to all comers or is it a condition of approval that this would be marketed to all?
DR. IBBOTT: This would be a new condition.
DR. STARK: Either an amendment to the existing motion or a new one.
DR. IBBOTT: The motion is for a post-marketing study which would certainly address the issues that you've mentioned.
DR. STARK: I didn't realize we had moved to the --
DR. IBBOTT: Yes. This motion we are discussing now is for a post-marketing study. Surveillance.
DR. MOORE: Just to make a point of clarification to Dr. Stark's comments, I think in the company's labeling they have made it very clear that there are certain type of abnormalities that are not appropriate for this device. I think some of the labeling already takes into consideration some of the points that you've raised.
DR. IBBOTT: Let's come back to the discussion on the post-marketing surveillance. Then, if necessary, we'll discuss the labeling again. Further discussion? If not, let's vote on this motion for post-marketing surveillance.
DR. KRUPINSKI: Yes.
DR. CONANT: Yes.
DR. FERGUSON: Yes.
DR. SOLOMON: Yes.
DR. BLUMENSTEIN: Yes.
DR. TRIPURANENI: Yes.
DR. STARK: Yes.
DR. IBBOTT: And one abstention. All right.
DR. STARK: I'm sorry if I missed the boat. I didn't realize we had closed the window and moved on.
DR. IBBOTT: I don't think we've closed any windows. We jumped to a motion to attach a condition or recommendation for post-approval surveillance but I don't think that prevents us from considering more conditions to approval.
DR. STARK: Well, if I can, to catch up, I've jotted down three to consider. All of these are, of course, subject to the FDA staff's decision.
DR. DOYLE: Hopefully one at a time.
DR. STARK: Yes. I would suggest that until it has been proved otherwise, which means in the current condition it hasn't been proved, that there be no claims, expressed or implied, of clinical significance. And that there be no use of the term significance.
I'm not just talking about lawyering this but in spirit as well as the letter of this recommendation, significance or the like except, as I discussed before, in the very narrow reference to ROC statistics and even then with some type of explicit disclaimer that that's not -- was in a nonclinical setting.
The only thing significant we've seen are statistics that are in a nonclinical setting and those have help assure us of the safety and efficacy but I don't think that should lead clinical radiologist to have to juggle claims of significance.
DR. SOLOMON: Do you see that being dependent upon the results of this clinical analysis that we're talking about?
DR. STARK: I don't think so. I would not say that satisfying anything that we have made as a condition would release them from this condition, but if the FDA finds additional data have established that this is clinically significant, then I would say the FDA should be free to waive that condition as a separate condition.
DR. IBBOTT: All right. This is a condition you would place on the labeling that the manufacturer must meet for approval.
DR. STARK: Yes.
DR. IBBOTT: Are your other -- you mentioned that you had three items. Do they also address the labeling?
DR. STARK: They are labeling, yes.
DR. IBBOTT: So perhaps we could group them together?
DR. STARK: Well, they might fail one at a time.
DR. IBBOTT: Then let's get a second on this one.
DR. KRUPINSKI: Can I ask is labeling the same as advertising?
DR. DOYLE: It comes under labeling
DR. KRUPINSKI: It is? Okay.
DR. IBBOTT: Is there a second?
DR. FERGUSON: Second.
DR. IBBOTT: All right, Dr. Ferguson. Okay. Any further discussion about this? This would be another condition placed on approval to presumably modify the labeling -- existing labeling and certainly when designing any new labeling to avoid claims of clinical significance.
DR. CONANT: I'm not quite sure we can do that yet. I want to see their data first. I think that could come later but I don't want to close the door on their data so I would be hesitant to vote yes. I'm sorry.
DR. STARK: I'm just saying if there is no more data or if the FDA finds that data insufficient.
DR. CONANT: Yes, sure. I trust that the FDA will do that but I'm not sure -- yeah, it's kind of a condition on a condition. It's sort of one step at a time. I think we have asked a big condition of looking at the data and that may all not show any kind of significance, clinical or other that we are asking for and then that becomes obvious. I don't get that really.
DR. IBBOTT: Dr. Blumenstein.
DR. BLUMENSTEIN: I'm going to vote no on this because I feel that I trust the FDA to deal with that given that we have a preapproval condition for clinical data.
DR. IBBOTT: Any further discussion?
DR. CONANT: One other. Sorry. David, in spirit I agree very much with what you're saying but we already voted and yes'ed a condition on labeling saying they had to take the stuff out. We did that a couple steps ago. I think we have suggested that we really feel this is important by voting on that. And then, again, the FDA is going to take it from there.
DR. IBBOTT: I think I feel the same way that we have asked them to do some more analyses of the existing data. The FDA may determine that detracts from the significance.
DR. STARK: I'd be happy to withdraw the motion if there is a consensus, or we better take a vote.
DR. IBBOTT: I think we can just go ahead and vote if that's all right. I should ask, though, is there anymore discussion before we vote? Dr. Krupinski?
DR. KRUPINSKI: No.
DR. CONANT: No.
DR. FERGUSON: Yes.
DR. SOLOMON: No.
DR. BLUMENSTEIN: No.
DR. TRIPURANENI: No.
DR. STARK: Yes.
DR. IBBOTT: There were two yeses and five nos and one abstention. So that motion is defeated. Are there motions for other conditions to attach to the approval.
DR. STARK: I have two more and I'll be brief.
DR. IBBOTT: Sorry.
DR. STARK: That's okay. I would ask that it be added to the label something to the effect or spirit of the following words. "Careful rereading or second reading may be equally or more safe and effective in a clinical setting."
DR. DOYLE: Could you say that again?
DR. STARK: "Careful rereading or second rereading may be equally or more safe and effective than a computed second reading in a clinical setting."
DR. IBBOTT: Is there a second for this motion?
DR. FERGUSON: Is that a directive to the radiologists rather than the instrument?
DR. STARK: It's a directive for -- I'm intending it, and forgive me for exploring this, but what a radiologist faced with purchasing this or using it will be told. I am proposing that he should be told that if he simply reread the scan himself or had a colleague double read it, that actually might be more efficient and safe than this product.
DR. KRUPINSKI: But you don't have any data to support your contention.
DR. STARK: That's why I said may be. They don't have any data to support theirs. I'm trying. I've only got one more.
DR. TRIPURANENI: I have difficulty with this.
DR. IBBOTT: We're looking for a second.
DR. STARK: If I don't have a second it goes. We'll move on.
DR. IBBOTT: No seconds. All right.
DR. STARK: Last, it's the same family. I'm just probing this boundary between nonapprovable and approvable with conditions. Not demonstrated safe or effective until there's data in patients with artifacts, concomitant lung disease, contrast media use, or pediatric populations.
DR. KRUPINSKI: Doesn't this come under the post-surveillance type stuff that we were asking for?
DR. STARK: I thought labeling. Condition of the labeling.
DR. CONANT: I think we are asking again for the data to be analyzed and included in that by case is looking at -- I mean, there were cases with artifacts and things like that. I think that is part of what the false negative and false positive analysis is going to provide us with. Again, it's a limited case set but depending on what that shows, the next set may be --
DR. STARK: If it is understandable to the FDA that we are assuming they are going to check this, I'm saying that we haven't seen these data and I was asking as a condition that the FDA ask to see it. I was just making that a motion. I mean, I know we can assume that they'll do this anyway.
I'm just trying to make it a specific direction. Of course, this is all advice and they can ignore all of this but if there is a consensus that they should do this, then that is, I think, the purpose of the motion I'm making which is to ask them to.
DR. IBBOTT: Go ahead.
DR. CONANT: Could it be that we could put this in the first condition which was the first preapproval condition that was to go back and look at these cases and we talked about by-case compared to by-nodule and quadrant, etc. Do you want to step back and beef that one up a little bit?
DR. SOLOMON: I think procedurally that will be a problem.
DR. CONANT: We can't do that? Okay.
DR. IBBOTT: We can address this motion with the understanding that, in fact, that is what will happen. We can deal with this motion independently at the first.
DR. CONANT: Could you reword your motion or could you restate it again? I didn't mean reword it. Just say it again.
DR. STARK: Yeah, and certainly someone -- I think all of these we are understanding that we haven't wordsmithed these. I'm simply suggesting that until the FDA sees data, which we hope is available, it should be a condition of premarket approval that the product will be labeled as not demonstrated safe or effective, or safe and effective with the use of contrast media in the presence of artifacts or concomitant lung disease or in pediatric patients.
DR. KRUPINSKI: From a nonclinician --
DR. FERGUSON: It's totally unexplored. I don't think we can suggest that the FDA look at these because I don't think we can put that into a formal motion because those things are unexplored as far as I know.
DR. CONANT: I think that you're saying that the labeling should read this but the point is if it doesn't get approved and it doesn't follow this condition that we first said about reanalyzing the data, there's no labeling here because it's not going anywhere. You're already jumping to labeling based on the data. It's kind of contradictory
DR. STARK: I am suggesting that if it is approved based on whatever, but we see no data on contrast media, artifacts, pediatrics, or lung disease that the labeling contain these restrictions.
DR. CONANT: If we see no data on those things.
DR. STARK: If the FDA is not satisfied with the data which includes not seeing any further data.
DR. CONANT: Okay.
DR. IBBOTT: The sponsor has indicated that their data do include cases with contrast and cases with artifacts. Are you suggesting that when they do the reanalysis that we've already asked them to do that they also pay attention or conduct an analysis to look specifically at the impact of artifacts or with versus without contrast?
DR. STARK: Yes. I'm saying that they say they have data that we haven't seen and that if they
-- offering them a choice of either satisfy the FDA that when they offer statistics on their data that it's convincing and labeling shouldn't apply or simply say we can market it and simply market it with the warning that if you patient has artifacts we haven't demonstrated safety and efficiency -- sorry, safety and efficacy.
DR. IBBOTT: So, yes. I'm not going to try and rephrase your motion but I believe that you're asking that the reanalysis we've asked them to do contain those elements to look at artifacts, contrasts. There were no pediatric patients so we won't include that.
Then depending on the results the labeling should be modified to indicate that the device is not appropriate for pediatric patients. For example, if the data don't support its use in pediatric patients. Is that right?
DR. STARK: That's correct.
DR. DOYLE: We need a second
DR. IBBOTT: Yes. We need a second.
DR. KRUPINSKI: The pediatric issue, I just talked to a clinician, could be significant. I mean, if the CAD --
DR. CONANT: I'm not really sure about this. I haven't looked at a pediatric chest -- well, actually I do on the weekends.
DR. IBBOTT: No one will ever know.
DR. CONANT: It won't get out of this room. Obviously kids were not analyzed. It was 19 and above so obviously pediatrics should be a contraindication. That should be included in the labeling. I think we all agree about that definitely.
DR. STARK: I think if we don't make a motion it's not obvious at all because I could wear the other hat.
DR. CONANT: I think I brought that up earlier when I said that has to be one of the things that we address with looking back over the data. At least, I'm sorry if I didn't. I don't remember what the transcript was but that's got to be something in the label and it's not in the contraindication line. The artifacts we could talk about as a motion. That sounds like a good idea. Maybe separate out from the artifacts and other things.
DR. STARK: I don't know where we are in pediatrics. Do we need a separate motion? Are you suggesting that I bifurcate this already complicated thing? I'm just trying to point the FDA to satisfy yourself on these things or exclude them.
DR. CONANT: I think there's a difference here of what there may be data on versus what there isn't a chance in hell they are going to be able to analyze because there's no babies or kids. I think it is different. I think it's two separate issues so I would say separate it.
DR. STARK: If you don't mind, why don't you make the motion on the pediatrics.
DR. CONANT: Contraindication no. Is it 19 and over? Eighteen. Sorry. No one under 18 should be analyzed with this.
DR. STARK: I'll amend my motion by dropping the word pediatrics. We can deal with that then and then you can have --
DR. IBBOTT: You've withdrawn. It wasn't seconded so that motion is withdrawn.
DR. STARK: I think we are still discussing it. I would like to say that until the FDA is satisfied from the existing data set or some other data set but not -- I'm suggesting it's a restriction because we haven't seen the data here that it be a condition that it be marked not demonstrated safe or effective in patients with concomitant lung disease or with lung disease -- known lung disease, scanning artifacts, or with contrast media. Again, we know they have data on contrast media. I hope it will convince the FDA but I'm asking that we require that.
DR. CONANT: Should we put pediatric under 18 first?
DR. STARK: I eliminated that from my motion hoping that you would carry forward with yours afterwards.
DR. CONANT: Okay.
DR. IBBOTT: You're seconding his motion?
DR. CONANT: No. He told me to do it independently so I just did that.
DR. IBBOTT: We need a second for the motion he just made.
DR. STARK: I think I'm trying to bargain with you.
DR. IBBOTT: You guys have to decide.
DR. CONANT: I think that is still part of the one we already passed where we've asked for further analysis of the existing data. I think we have already covered that. That's why I'm not seconding it because I think we are already asking them. I mean, if you reanalyze the data and they find that they can't support what you want, then yours is a condition on the condition that they don't find it. But if they do the analysis and find it, then your condition isn't needed.
DR. STARK: I think it's sufficiently likely that they are not going to have statistically convincing data on artifacts or post-op patients or patients with pneumonia. I am trying to attach a condition that will help the FDA simply say put in the label you should be careful and not use it in these patients because it's unproved. I believe they have data on contrast media but I'm lumping them all of them in the same.
I'm saying these are identifiable important subsets just like the pediatrics issue. I'm simply saying specifically look at the analysis for these things and assuming that there is not satisfaction here in some of them, please label the product appropriately.
DR. IBBOTT: Dr. Tripuraneni.
DR. TRIPURANENI: I think there are lumpers and splitters. I'm a lumper. I think FDA is hearing what we are saying and I think rather than go down to the final nitpicking and actually spell out everything, I would rather leave it to the broad discretion of the FDA to decide the best in their best judgment. I really don't support this element.
DR. IBBOTT: We don't have a second yet. Is anyone willing to second the motion? All right. Does someone want to make the other motion regarding pediatric patients?
MS. BROGDON: May I make a comment first? I just wanted to describe how we treat contraindications. We use the term contraindication to mean something you shouldn't do because there are data that say you must not do that. There must have been some sort of demonstration of harm. Short of that, there are warnings and there are cautions and other things that you can say in the labeling that don't reach the level of contraindication.
DR. IBBOTT: Good distinction. Thank you.
DR. CONANT: Maybe this is a post-marketing study. They've got to apply it to kids. I don't know if that -- maybe it's just a warning saying there is no data to support this use under 18.
DR. KRUPINSKI: Somewhere it has to be stated or brought out in the manual or in the warnings or somewhere there is obviously not a contraindication but it should be there somewhere.
DR. STARK: The rocket scientist in me says that why are children different than adults and it's probably going to work. But as a human, as a parent, I have a hard time saying these are just small adults. On the other hand, the admonitions of lumping and leaving it to the FDA, this is all on record, I've spoken. My conscious is satisfied. I'm going to leave it to someone else to make a motion.
DR. IBBOTT: This is certainly something that could be included in a recommendation for a post-market study and I think we have probably done that or implied that.
Any other conditions people would like to attach?
DR. CONANT: Have we figured out the pediatric one?
DR. IBBOTT: We have not. I have made the assumption that the sponsor and the FDA understand from the discussion that a post-market study would include pediatric patients.
MS. BROGDON: I'm advised that since the sponsor has not indicated that it is -- could be used in pediatric patients, FDA would in most circumstances include some sort of statement in the labeling that it has not been studied and it is not intended for use in children.
DR. CONANT: There you go. I'll second that motion.
DR. STARK: I say yes.
DR. IBBOTT: The relief is palpable. I think unless there are other motions for conditions, we are ready to vote on the main motion which is for approval with conditions, the conditions being those we've just discussed.
DR. DOYLE: The ones that were seconded and approved.
DR. IBBOTT: That's right. So we do have the motion and so unless there is any further discussion on the main motion, we'll proceed to a vote on the motion to approve with -- as approvable with conditions.
DR. KRUPINSKI: Yes.
DR. CONANT: Yes.
DR. FERGUSON: Yes.
DR. SOLOMON: Yes.
DR. BLUMENSTEIN: Yes.
DR. TRIPURANENI: Yes.
DR. STARK: Yes.
DR. IBBOTT: And with Dr. Mehta's abstention the rest of the votes are all in favor so that motion carries. We have declared this approvable with conditions and we've approved a number of conditions. At this point we go around the room and ask the voting members to explain the reasons for their vote. Dr. Krupinski, again, we'll start with you and ask you to identify the reason for your vote on the decision as approvable with conditions and also on the recommendations. You can probably summarize your reasoning.
DR. KRUPINSKI: Why doesn't somebody else start because, I mean, it seems like I would just say the entire conversation we just had all over again. I agreed with all the changes or the conditions that we brought up. I think they satisfied the questions we had throughout the day and so I voted yes.
DR. IBBOTT: I think that's fine.
DR. CONANT: That's basically the same with me. I'm just concerned about how the statistics -- how the analysis will differ with case-based versus actionable nodules and quadrants. I, again, applaud you all for the beautiful study you have done and answering the questions given to you by the FDA.
I hope you have this data to show us because I think this could be a wonderful tool. As these things go they only get better over time. I think it really could have benefit to patients. But I really need that data.
DR. IBBOTT: Dr. Ferguson.
DR. FERGUSON: I agree with everything she said.
DR. IBBOTT: Dr. Solomon.
DR. SOLOMON: I think you should be applauded for dealing with the problem that is an important clinical problem. I think there are two issues that the panel is charged with. The first one being safety. I think the issues of always and never are the issues on safety and I thin there are ways you can address these and we have discussed those today.
The second issue that we are charged with is efficacy. I think the key word there is clinical efficacy and I'm not sure we were able to see exactly the clinical efficacy with the way the data was cut up and divided so that we think that if you were to look at it again with that in mind, it might be able to get through to the FDA.
DR. IBBOTT: Dr. Blumenstein.
DR. BLUMENSTEIN: I was disappointed that neither the sponsor came forward with clinical analysis, and I'm also disappointed that the FDA didn't require that of them, especially since our criteria before approval for efficacy has clinical efficacy mentioned in it. I'm also discomforted by the unique properties of this study designed that may lead to inaccurate assessment of the ROC methodology.
DR. IBBOTT: Dr. Tripuraneni.
DR. TRIPURANENI: I would like to congratulate R2 for actually coming up with this new concept. You are a pioneer in the CAD and it's good and bad. It's bad that being the first one we are going to hold you to a higher standard because we have ideas about what is right and what is wrong. Somebody else that is going to come after you their life would be a lot easier because they are going to learn from your mistakes. On the other hand, I think you have done a very good job on this.
I personally think actually any improvement in the process actually will ultimately lead to the improvement in care. I think it's important actually that we continue to pursue to improve the processes that ultimately improve the care. That is the reason why I think we attach those amendments and I firmly believe it will make a positive impact on the patients. That is the reason why I vote yes with amendments.
DR. IBBOTT: Dr. Stark.
DR. TRIPURANENI: Can I just add one thing? I really would like to see FDA asking for clinically efficacy because I participated in the Cardiovascular Devices Panel, as I say, participated on the other side of the table a couple of times and they kept pointing the table to where is the clinical data, where is the clinical efficacy. I would ask the sponsors to give us some clinical data when appropriate.
DR. IBBOTT: Dr. Stark.
DR. STARK: Well, first, as lead clinical reviewer I would like to thank everybody on the panel, everybody in the audience, especially R2 for listening carefully and responding to my many adversarial comments. I think that was part of my role here today to be both the adversary as well as one of the voting judges. I thank the chair. It's been a very efficient, respectful proceeding.
Having said that, I, again, agree with Dr. Blumenstein's assessment as a statistician. I note that both the lead reviewers had a viewpoint strongly held that was overwritten by the rest of the committee and I can now step back and agree with Dr. Conant who has emphasized, and those reading the transcript would not have seen her facial expression and the movement of her fist in terms of emphasizing that we are now relying on the FDA staff to continue diligently what they have already said is a nearly overwhelming task. Not just for their manpower and resources but for their range of skills. I think this committee and the people in this room and a larger group, I believe, needed to address this again to relook at these data but I accept that I have been outvoted and we will now rely on what is clearly a very competent, energized and well-supported FDA staff to essentially accomplish the same thing that Dr. Blumenstein and I were pushing for but as Dr. Conant and the majority have voted. Thank you.
DR. IBBOTT: Thank you. I would like now to ask the nonvoting representatives to comment on the recommendations that have been made. Ms. Moore.
DR. MOORE: Although I did not vote, I think I would have been in agreement with the panel on recommending this for approval. I think that any improvement in our ability to detect nodules that are not being detected is an important step forward and I commend R2 on their efforts and view of the data and trying to move this technology forward.
DR. IBBOTT: Mr. Burns.
MR. BURNS: The conditions satisfy the concerns that I had regarding the study size and the data set and the small change in the area under the ROC. I think by analyzing the data we will see if there is some better significance with the data.
DR. IBBOTT: Good. Thank you. I would like to just give Dr. Mehta a chance to make any comments he might have.
Dr. Mehta, do you have any comments?
DR. MEHTA: No. I think I just want to thank Geoff Ibbott for doing an excellent job of running the meeting. Although I didn't hear all the proceedings, I think I heard enough to concur with what actually happened. Thank you, everybody.
DR. IBBOTT: Thank you, Dr. Mehta.
DR. DOYLE: Before we adjourn for the day, I would like to remind the panel members that they are required to return all the materials that were sent pertaining to the PMA itself. Materials you have with you may be left at your table and any other should be sent back to me at the FDA as soon as possible.
DR. IBBOTT: Thank you. Finally, I would like to thank the speakers and the members of the panel for their preparation and participation in this meeting. I would like to especially thank Dr. Stark and Blumenstein for serving as lead reviewers for the panel and doing an excellent job of summarizing this and helping the rest of us understand it.
And I would like to thank the sponsors for graciously responding to the many questions that were aimed at them and for putting on an excellent presentation.
Since there is no further business, I would like to adjourn this meeting of the Radiological Devices Panel. Thank you.
(Whereupon, at 5:20 p.m. the meeting was adjourned.)