FOOD AND DRUG ADMINISTRATION
CENTER FOR DEVICES AND RADIOLOGICAL HEALTH
RADIOLOGICAL DEVICES ADVISORY PANEL
MEETING
TUESDAY,
FEBRUARY 3, 2004
The
Panel met at 9:00 a.m. in Salons B-D of the Gaithersburg Marriott Washingtonian
Center, 9751 Washingtonian Boulevard, Gaithersburg, Maryland, Geoffrey S.
Ibbott, Ph.D., Acting Chairman, presiding.
PRESENT:
GEOFFREY S. IBBOTT, Ph.D., Acting Chairman
BRENT BLUMENSTEIN, Ph.D., Temporary Voting Member
CHARLES B. BURNS, M.S., P.H., Non-Voting Consumer
Rep.
EMILY F. CONANT, M.D., Voting Member
THOMAS FERGUSON, M.D., Temporary Voting Member
ELIZABETH KRUPINSKI, Ph.D., Temporary Voting
Member
MINESH P. MEHTA, M.D., via teleconference,
Chairman
DEBORAH J. MOORE, Non-Voting Industry
Representative
STEPHEN SOLOMON, M.D., Temporary Voting Member
DAVID STARK, M.D., Temporary Voting Member
PRABHAKAR TRIPURANENI, M.D., Voting Member
ROBERT DOYLE, Executive Secretary
FDA REPRESENTATIVES:
NANCY BROGDON
NICHOLAS PETRICK, Ph.D.
ROBERT A. PHILLIPS, Ph.D.
WILLIAM SACKS, Ph.D., M.D.
ROBERT F. WAGNER, Ph.D.
SPONSOR REPRESENTATIVES:
RONALD CASTELLINO, M.D.
PABLO DELGADO, M.D.
HEBER MacMAHON, M.D.
DAVE MILLER
KATHY O'SHAUGHNESSY, Ph.D.
A-G-E-N-D-A
Open Session
Call to order and the Panel Introduction, Dr.
Geoffrey Ibbott, Ph.D., Acting Chairman................................. 4
FDA Introductory Remarks, Robert J. Doyle,
Executive Secretary 7
Update on FDA Radiology Activities, Robert A.
Phillips, Ph.D. 13
Open Public Hearing
Open Public Hearing; interested persons may
present data, information, or views, orally or in writing, on issues pending
before the committee 14
Open Committee Discussion
Charge to the Panel, Dr. Geoffrey Ibbott, Ph.D. 16
Overview of Contemporary ROC Methods, Robert F.
Wagner, Ph.D. 17
Presentations on P030012 by Sponsor
Introduction,
Kathy O'Shaughnessy, Ph.D.. 95
Current
Clinical Practice, Heber MacMahon,
M.D...................................... 97
Device
Description and Clinical Trial
Introduction,
Ronald Castellino, M.D. 103
Clinical
Study, Dave Miller............. 115
User
Experience, Pablo Delgado, M.D..... 143
Summary,
Kathy O'Shaughnessy, Ph.D...... 148
Lunch
Presentations on P030012 by FDA
PMA
Overview, Robert Phillips, Ph.D..... 174
Clinical
Background, William Sacks,
Ph.D.,
M.D........................ 175
Clinical
Results, Nicholas Petrick, Ph.D. 179
PMA
Review Summary, William Sacks,
Ph.D.,
M.D........................ 202
Reports by Panel Lead Reviewers
David
Stark, M.D........................ 212
Brent
Blumenstein, Ph.D................. 225
Presentation of FDA Questions................. 232
Break
Panel Discussion.............................. 234
Open Public Hearing
Open
Public Hearing: interested persons may
present
data, information, or views, orally or
in
writing, on issues pending before the
committee............................... 309
Open Committee Deliberations
Panel
Recommendation(s) and vote........ 311
Adjourn
P-R-O-C-E-E-D-I-N-G-S
9:06
a.m.
DR.
IBBOTT: I would like to call this
meeting of the Radiological Devices Panel to order. I also want to request that everyone in attendance at this
meeting be sure to sign in at the attendance sheet that is available outside
the door. I would note for the record
that the voting members present constitute a quorum and is required by 21 CFR
Part 14.
At
this time I would like each panel member at the table to introduce himself or
herself and state his or her specialty, position title, institution, and stages
on the panel.
I'll
begin with myself. Some of you have
already figured out that I'm not Dr. Mehta.
Thanks to the vagaries of air travel and weather, Dr. Mehta is unable to
be here but is joining us by speaker phone.
I'm
Geoff Ibbott. I'm a medical
physicist. I work at the University of
Texas, M.D. Anderson Cancer Center in the Department of Radiation Oncology and
Radiation Physics. I'm a voting member
on this panel and have been for several years.
Obviously I'm standing in as chair for this meeting.
Then,
Charles, let's start with you and we'll go around the table and introduce
ourselves.
MR.
BURNS: Charles Burns, Professor of
Radiologic Science at the University of North Carolina. My primary expertise is Imaging Diagnostic
Physics and I'm a nonvoting consumer representative
DR.
IBBOTT: Thank you.
DR.
MOORE: I'm Deborah Moore. I'm the Vice President of Regulatory and
Clinical Affairs for Proxima Therapeutics.
I'm the industry representative for the panel and a nonvoting member.
DR.
STARK: I'm David Stark. My current title is President of MRI of Dettum
in Massachusetts. I'm a clinical
radiologist. I've been a chairman for
close to nine years and I know many of you.
I'm pleased to be here. Thank
you.
DR.
TRIPURANENI: Prabhakar
Tripuraneni. I'm head of Radiation
Oncology at Scripps Clinical in La Jolla, California. I have a practice and full-time clinician radiation oncologist
and I am a voting member. I think this
is my first or second date on the panel.
DR.
DOYLE: I'm Bob Doyle. I'm the Exec. Sec. of this panel.
DR.
BLUMENSTEIN: I'm Brent
Blumenstein. I'm a biostatistician in
private practice. I'm normally on the
General and Plastic Surgery Panel.
DR.
SOLOMON: I'm Steve Solomon. I'm a radiologist at Johns Hopkins
Hospital. I'm a consultant to the
panel.
DR.
FERGUSON: I'm Tom Ferguson, professor
emeritus of cardiothoracic surgery at Washington University School of Medicine,
St. Louis. I'm a temporary voting
member on this panel. I'm on the
Cardiovascular Device Panel.
DR.
CONANT: I'm Emily Conant. I'm the Chief of Breast Imaging at
University of Pennsylvania and sort of half research and half clinical at this
point. I'm a voting member.
DR.
KRUPINSKI: I'm Elizabeth Krupinski from
the University of Arizona. I'm a research
professor in the Department of Radiology.
My area of expertise is observer performance and image perception
studies. I'm a voting member.
MS.
BROGDON: I'm Nancy Brogdon. I'm not a member of the panel. I'm the liaison to the agency. I'm the Director of the Division of
Reproductive Abdominal and Radiological Devices.
Dr.
Mehta, would you like to introduce yourself?
DR.
MEHTA: Yes, please. I'm Minesh Mehta. I'm a radiation oncologist in terms of specialty and I'm the
Chair of the Department of Human Oncology at the University of Wisconsin. Generally when I'm there I'm chair of the
panel but today I guess I'm listening in.
DR.
IBBOTT: All right. Thank you, everyone. Mr. Doyle would now like to make some
introductory remarks.
DR.
DOYLE: Well, first on the agenda here
is appointment of the Acting Chairperson.
Pursuant to authority granted under the Medical Devices Advisory
Committee Charter dated October 27, 1990, and as amended August 18, 1999, I
appoint Geoffrey Ibbott, Ph.D., as Acting Chairperson of the Radiological
Devices Panel Meeting on February 3, 2004.
This is signed by David Feigal, the Director of the Center of Devices
and Radiological Health.
Now
I would like to read the appointment of temporary voting status. Again pursuant to the authority granted
under the Medical Devices Advisory Committee Charter dated October 27, 1990,
and as amended August 18, 1999, I appoint the following individuals as voting
members of the Radiological Devices Panel for the meeting on February 3, 2004,
and they are as follows:
Brent
Blumenstein, Ph.D., Thomas Ferguson, M.D., Elizabeth A. Krupinski, Ph.D.,
Stephen Solomon, M.D., and David Stark, M.D.
For
the record, these individuals are special government employees and consultants
to this panel under the Medical Devices Advisory Committee. They have undergone the customary conflict
of interest review and have reviewed the material to be considered at this
meeting. Again, signed by David W.
Feigal for the Center of Devices and Radiological Health.
Finally,
the conflict of interest statement. The
following announcement addresses conflict of interest issues associated with
this meeting and is made part of the record to preclude even the appearance of
impropriety.
To
determine if any conflict existed, the agency reviewed a submitted agenda for
the meeting and all financial interest reported by the committee
participants. The agency has no
conflicts to report.
In
the event that the discussions involved in any other products or firms not
already on the agenda for which an FDA participant has financial interest, the
participants should excuse him or herself from such involvement and the
exclusion will be noted for the record.
With
respect to all other participants we ask in the interest of fairness that all
persons making statements or presentations disclose any current or previous
financial involvement with any firm whose products they may wish to comment
upon.
Now,
if there is anyone who has anything to discuss concerning these matters which I
have just mentioned, please advise me now and we can leave the room to discuss
them. Seeing none, the FDA seeks
communications with industry and the clinical community in a number of
different ways,
First,
the FDA welcomes and encourages pre-meetings with sponsors prior to all IDE and
PMA submissions. This affords the
sponsor an opportunity to discuss issues that could impact the review
process. Second, the FDA communicates
through the use of guidance documents.
Toward this end, the FDA develops two types of guidance documents for
manufacturers to follow when submitting a premarket application.
One
type is simply a summary of the information that has historically been
requested on devices that are well understood in order to determine substantial
equivalence.
The
second type of guidance document is one that develops as we learn about new
technology. FDA welcomes and encourages
the panel and industry to provide comments concerning our guidance documents. I would also like to remind you that the
meetings of the Radiological Devices Panel for the remainder of this year are
tentatively scheduled for May 18th, August 10th, and November 16th.
You
may wish to pencil these dates in on your calendar but please recognize that
these dates are tentative at this time.
I'll repeat them in case you didn't get those. May 18th, August 10th, and November 16th.
DR.
IBBOTT: Thank you, Mr. Doyle.
At
this point Nancy Brogdon, who is Director of the Division of Reproductive,
Abdominal, and Radiological Devices of the Office of Device Evaluation has a
few words she would like to say.
MS.
BROGDON: Thank you, Dr. Ibbott. We have three panel members whose terms just
expired on January 31st. They are not
present today but we wanted to recognize publicly their contributions to the
panel.
The
first is Mr. Ernest Stern. Mr. Stern
was the Chairman and CEO of Thales Components located in Totowa, New Jersey,
and he was the industry rep on the panel for the past four years. He is now retired from Thales.
Mr.
Stern effectively represented various industries served by this panel and used
his position on the panel to apprise other panel members of commercial
considerations that they should take into account when making recommendations
on the various applications under review.
Second
is Dr. Wendy Berg. Dr. Berg was the
Director of Breast Imaging in the Department of Radiology at University of
Maryland at Baltimore. She served on
the panel for four years as a voting member.
Dr. Berg brought to the panel a high degree of expertise in the field of
mammography.
That
was continually called upon as novel mammography related devices were reviewed
by the panel. In addition, when asked,
she provided written reviews of complex devices applications that the agency
used as part of our in-house review process.
Third
is Dr. Harry Genant. Dr. Genant is
Professor of Medicine and Epidemiology, Orthopedics, and Surgery at the
University of California at San Francisco.
He also served as a voting member for four years. Dr. Genant brought to the panel a brought
spectrum of expertise with special emphasis on bone densitometry. His probing questions and insightful
comments on the pros and cons of the devices being considered were very helpful
to the agency as it reviewed the safety and effectiveness of new devices.
We
thank all of these past panel members.
each will be sent a thank-you from the commissioner along with a mounted
service plaque. Thank you.
DR.
IBBOTT: Thank you.
Dr.
Robert Phillips, the Chief of the Radiology Branch of the Office of Device
Evaluation will now give a brief update on the FDA radiology activities. Dr. Phillips.
DR.
PHILLIPS: Well, good morning
again. As you can see by the absence of
meetings between December '02 and now, we have not had a whole bunch of brand
new PMAs that we've brought to the panel.
In fact, in the last year we have not approved any PMAs.
However,
there have been some changes in the branch itself and we have brought four new people
on board as reviewers. These are Nancy
Wersto who comes to us from industry.
She's a radiological physicist and her interest area is in radiation
therapy products.
Then
we have Kish Chakrabarti who comes to us from the mammography side of the center. He is a physicist. His area of interest is mammography and imaging systems. Kish, are you here today? No.
Dr.
Barbara Shawback comes to us from outside.
She's a medical officer and her area is study and design in
rheumatology.
And
then we just had a new employee come on board, Sophie Packerel. She is a physicist who comes from the
University of Chicago and her area is CAD systems.
Those
are the four people that have come on board and ends my talk. Thank you.
DR.
IBBOTT: Thank you. We'll now proceed with the first of two
half-hour open public hearing sessions for this meeting. The second half hour open public hearing
session will follow the panel discussion this afternoon.
Both
the Food and Drug Administration and the public believe in a transparent
process for information gathering and decision making. To ensure such transparency at the open
public hearing session of the advisory committee meeting, FDA believes that it
is important to understand the context of an individual's presentation.
For
this reason, FDA encourages you, the open public hearing speaker, at the
beginning of your written or oral statement to advise the committee of any
financial relationship that you may have with the sponsor, its product and, if
known, its direct competitors.
For
example, this financial information may include the sponsor's payment of your
travel, lodging, or other expenses in connection with your attendance at the
meeting. Likewise, FDA encourages you
at the beginning of your statement to advise the committee if you do not have
any such financial relationships. If
you choose not to address this issue of financial relationships at the
beginning of your statement, it will not preclude you from speaking.
No
individual has given advance notice of wishing to address the panel. If there is anyone now wishing to address
the panel, please identify yourselves at this time.
Seeing
none, I would like to remind public observers at this meeting that while this
portion of the meeting is open to public observation, public attendees may not
participate except at the specific request of the chair.
We
can now begin the first open public portion of the meeting. We will now, as I said, proceed with the
open committee discussion portion of this meeting that has been called for the
consideration of PMA 030012 for a computer-aided detection, CAD device, that
assist a physician in identifying actionable, solid nodules in CT images of the
lung.
The
first presentation will be by Dr. Robert F. Wagner of the FDA who will give an
overview of contemporary ROC methods such as may be used in measuring the
effectiveness of the CAD and other imaging devices.
The
sponsor, R2 Technology, Inc., will then state its case for the PMA and they
will be followed by the FDA with its review of the device. We will proceed now with Dr. Wagner's
presentation.
DR.
WAGNER: Cybersource as I am, let us see
if I can -- okay. Progress or
regress? Let's not start from the back. Marvelous.
Thank
you very much, Bob. I'm glad we planned
this together this way. Good morning to
the members of the panel, my colleagues and visitors today. I must acknowledge the fact that Dr. Bill
Sacks and I were awakened by our respective wives at our respective homes every
two hours this morning to see what the weather would be like to see if we would
be able to make it and what time we should really get up. We are working against that as our
background.
I
would also like to thank my colleagues for giving me this opportunity to
present this tutorial information on an overview of the contemporary ROC
methodology as it is used today in the field of medical imaging and computer
assisted devices.
Of
course, most of us know what the letters stand for. ROC stands for receiver operating characteristic. This is the historic name that comes down to
us from the field of radar in signal detection studies where the problem is
you're looking at a field of clutter and the question is is there an airplane
in that clutter.
In
the field of psychology and this perception in eye and brain coordination
studies, this subject is often called the relative operating
characteristic. Some people are just
weary of the R and just refer to this as the operating characteristic because
that's really what it is.
Those
of us in the field of medical imaging have retained the name of receiver
operating characteristic. I think it is
because of our devotion to the classic literature from about 30 years or so ago
that we have just retained, the conservative people that we are. I see a person who has worked in this field
looking back at us.
Well,
now here is an outline of the talk. We
will spend a few minutes talking about efforts toward consensus development on
the present issues. Then we'll move
right into the ROC paradigm. We'll talk
about how it gets complicated by the problem of reader variability. How the multiple reader multiple case, or
so-called MRMC ROC paradigm, arose to address this problem of reader
variability.
Since
the ROC is a measurement, you have to have a meter stick of some kind so we'll
talk about measurement scales. There
will be a categorical scale, patient management or action scale and a
probability scale that we'll talk about.
Then
for today's submission, and submissions like it, there are additional
complications from the problem of location uncertainty, from the problem of not
really knowing the truth and dealing with uncertainty in the truth. Since the truth is uncertain, you really
don't know how many effective number of samples you really have.
When
you have a system that's going to cue readers about the possibility of lesions
on a case, there is a problem of reader vigilance that we will discuss. Finally, we'll give a little wrap-up which I
won't have to give because Bob Phillips just presented it for me.
Let's
start off now with efforts toward consensus development on the present
issues. The fact is that at the moment
we do not have an explicit FDA guidance on how to review, how to submit and
review issues like the present one.
There's been a lot of work going on and deep background as to how did we
get here.
The
basic idea is how do you use the classic concepts of sensitivity, specificity,
and ROC analysis to assess performance of diagnostic imaging and
computer-assisted systems. Especially
since there are many new issues and levels of complexity that come to the fore
as more complex technologies emerge.
At
the moment you see there is really no software to do the assessment task of the
problem we have before us. That's why I
would like to talk about piecemeal, all the different pieces and what is known
and what does exist at the moment because the sponsor had to put together a
creative combination of these many things.
So continuing on this little laundry list. I'll give you an historical laundry list of efforts toward
consensus development on these present issues.
That's
RSNA. Most of you recognize that. That's the big Radiological Society of North
America meeting that's held every year in November in Chicago that makes this
weather look very mild today. Then
following RSNA by a few months is the big SPIE medical imaging meeting. At the SPIE meetings we generally handle the
more technical aspects of the issues that come up at the RSNA.
Then
there's a society that meets every two years called the Medical Image
Perception Society of which Elizabeth Krupinski on our panel has been president
for 40 years I think it has been. Elizabeth
is the President of the Medical Image Perception Society. We hold various workshops and literature
every two years.
In
all these meetings every few years we do note progress in this field. There is tremendous progress going on but
it's without a doubt still an evolving work in progress. We are still not at the holy grail point
that we would like to be at but a lot of progress has indeed been made.
At
the good old FDA at our center in CDRH here at the FDA. One of the methods that I'll be talking
about today is the so-called multiple reader multiple case, the MRMC scheme
which has already been used for several submissions.
It
was used to break the log jam that was holding back digital mammography from
the market place so the MRMC scheme that I'll talk about in a few minutes was
used there. It has been used for all
successful submissions of digital mammography PMAs to our center.
This
method that we'll talk about in a few moments has also been used for a
successful submission in the area of a computer aid for lung nodule detection
on chest x-ray film that is in some way analogous to the present submission but
it's just on plain film.
NCI,
National Cancer Institute, also has lung image database consortium and
workshops. This is an NCI funded group
of five universities and the principle director of that project, I though I saw
him come in a moment ago. There he is,
Larry Clarke.
There
are five universities that work as part of this consortium and they are seeking
consensus on a number of things, one of which is how to put together a database
of annotated films of the kind that you would use, annotated CT slice images of
the kind you would use to train and test a classifier in this field of
computer-aided detection and diagnosis in lung cancer screening for
nodules.
So
that project is about half-way through its five-year history. A good two years underway right now. They are also addressing consensus on the
many issues that you have to deal with when you want to deal with such a
product.
For
example, how do you keep score statistically?
Once you know how to keep score, then you can start to design the size
of a database. How do you outline the
nodules? How do you keep score when
there's a hit when there is just finite overlap between what is known of the
lesion and what the reader marks? We'll
talk about this in a few moments.
Now,
two of here in our center have been quite active members of this LIDC from the
beginning. Let me see if I have another
comment here. Yeah. The thing I would like to bring to your
attention this morning is that there has been a great amount of communication
among all these resources here. A
number of us in our center here are active members of the research community in
this field.
Many
of us here and sitting just behind me have been very active in this area of
applying these methods to several of the submissions in the area of imaging a
computer-aided diagnosis. Several of us
are very active members, Larry Clarke's group here.
What
we have tried to do is see this as several quarters, four quarters if you will,
if a quadrangle all holding the windows open to the others so the people who
come in to us from industry at any given moment will know what is the state of
the art from the academia, from our own center, and from the LIDC.
We
presented them all the papers, all the current drafts even, and made sure that
everyone knows what's on the other people's mind methodology wise that is
outside the area of anything that is proprietary. Anything that is not proprietary is all strictly methodology or
statistics. We have tried to keep these
communication channels as open as we could.
Here
we go with the promised little tutorial and the fundamentals of the ROC
paradigm itself. The idea is, of course,
that you have two populations, one a population of actually diseased
people. You might think of these as
people with diabetes, for example, and a population of people who do not have
the disease.
You
would like to have a test that puts out a result something like a volt meter or
a biochemical assay or, in the case of a simple blood sugar test, this would
just be the blood sugar concentration.
You would love to have the world such that the two populations would be
separated and you could just drop a threshold in here and say these patients
are diseased and these patients can go home and not worry about it.
Now,
in the field of medical imaging those of us who have done work in that field
you don't have a simple meter or biochemical assay. What you get is a reader looking at about a million pixels of a
picture and trying to get the features out of it and reduce that through what
we call the subjective likelihood, subjective judgment or likelihood that case
is diseased.
Now,
as I say, this is really not quite the way the diabetes blood sugar test works
but if you think of what I am about to tell you in that context for the next
few minutes, you won't be far off base.
It's not precise but it wouldn't be misleading.
So
here is what happens more typically.
The two populations are not separated.
The diseased population and the nondiseased population as far as their
test result is concerned have a very great overlap. The idea is now who do you send home and who do you send on for
further workup or people that you want to treat for a condition.
Those
of you who have seen this before, what I've just done I've taken these two and
dropped this population down so that you won't get mixed up with the
colors. Now we have the nondiseased
cases and the diseased cases on the same axis, the same relative position. Now in a practical situation with the
overlap, now we have to set ourselves a threshold.
If
this is a blood sugar test, for example, you could set it at 150 blood sugar
level. If you do that, you'll pick up
about half of the actual diabetic patients so we say we have a true positive
fraction of 50 percent but you have to pay for this price. You have about a 10 percent false positive
fraction so here is this point, 50 percent true positive and roughly 10 percent
false positive.
We
call this a less aggressive mind set and I think you'll see the reason for that
in just a moment. So if we get a little
bit more aggressive to try to pick up more patients in our sieve, we might set
the threshold down here at 100 instead of 150.
Now we get about 80 percent of the diabetic patients and now at the
price of about 20 percent false positive or 25 percent. Here I've put this point about 80 percent
and 25 percent.
Let's
get even more aggressive and what I mean by that is I want to pick up more
diseased patients in my sieve, the sieve being the test. If you set the threshold in the 90's, now we
might get almost 95 percent of the patients in our sieve of the actual diabetic
patients but then we have to pay the price of 50 percent of the nondiabetic
patients picked up so now we have a 90 percent sensitivity and roughly a 50
percent sensitivity.
Now,
you can take this to the extreme and we talk about this particular test all the
time and I think this might not work because the threshold now -- oh, it did
work. Okay. We can put the threshold all the way to the left and call
everybody to the right of this diseased and we would get all the diabetic
patients. There's a little mark right
up here. We would get also -- the price
we would pay is we would have to call everybody who is not a diabetic a
diseased patient here so we would generate that point.
I
think you can see and let your imagination go wild that you can certainly fill
in all these points. Don't blink,
anyone. I saw Dr. Bob Doyle blink there
so I have to go back and do that again.
Instead of working up more and more levels of aggressiveness, you could
back off. You could start off with
everybody at the sick point and then just back off, move the threshold the
other way and fill in the complete ROC curve.
You can see at this time of day I'm very easily amused.
Okay. Here is the overall picture now. This is the case of the schematic of, let us
say, blood sugar as a test for diabetes.
These are these two populations and the way they overlap and here is the
corresponding ROC curve with the level of aggressiveness increasing.
Now,
it can happen and, in fact, we've seen things like this in our center and you
see this in the laboratory once in a while, the two populations could fall
right on top of one another so that a test cannot actually discriminate between
the two conditions so what we've done here is just drop this population and
this population on top of each other.
Now if you generate an ROC curve the way I just showed you, you would
generate what we call the chance line or guessing line.
Toward
the other extreme you could have a test that separates the two populations very
well. In that case, as we move the
threshold across from less aggressive to more aggressive, we'll generate this
ROC curve. Now we have the guessing
line, we have the ROC curve corresponding to almost typical clinical laboratory
test, and we have the ROC curve here for a very good test. We call this the level of increasing -- we
call this direction the direction of increasing reader skill or increasing
level of technology.
Now,
many people like to have a single summary measure of ROC curve performance and
what has traditionally been used is you take the area under the curve so the
area under this curve, say the diabetic discrimination test, is something in
the high 70s. Let's call it 78 percent
or something like that.
If
you use the area under the curve as a summary measure of performance, in
effect, remember if you think of calculus, you're getting this area you're just
integrating, you are effectively replacing the curve with a line that is fault
at the level of that area.
In
effect, what you've done is you have averaged the sensitivity with a true
positive fraction over all false positive fractions. In effect, if you use the area of the curve you are given the
sensitivity averaged over all false positive fractions or sensitivity averaged
over all specificity, specificity coming from the other direction.
Well,
I hope it gets interesting now. That
was the easy part. That's the
idea. Let's see what really happens in
the real world. In the real world in
the last decade those of us who work in this field have been made acutely aware
of the complication of reader variability.
I'm
going to show you some very famous data.
I think Emily Conant knows this like the back of her hand from having
worked with Craig Beam. For those of
you who have not seen this before, I have to give a little build up to
this.
This
is a set of data from Beam, Layde and Sullivan that I'm going to show you in
which they studied 108 mammographers randomly chosen from around the United
States. The mammographers in this study
were given a set of mammograms. They
were asked to set their threshold for action.
Remember
when we were talking about this ROC paradigm we were moving a threshold and we
wanted to set it at some place and the question is in a clinical laboratory
test you could just dial that in somehow.
How do you do it in medical imaging?
You don't have a dial.
You
have to deal with the human reader and they were asked to set their threshold
between their sense of the boundary on the BIRADS scale, Breast Imaging and
Reporting and -- Reporting or Recording?
Anyway, Reporting and Data System.
That's the American College of Radiology Scale that is used for managing
patients in mammography.
These
readers were asked to set their sense of the boundary between category 3, which
is generally six-month follow-up recommendation, and category 4 which is highly
suspicious and recommend consideration of biopsy. I'm sure I'm garbling that but you get the general idea. I wasn't asked to leave the room so I
couldn't be too far off there.
Here's
what happened. This is a true positive
fraction versus a false positive fraction for 108 readers. There are 108 points here. Each one of these people thinks that they
had set the boundary between category 3 and category 4.
If
you try to do public policy based on category 3 and category 4 and thinking
that people have optimized that, the optimum is very broad. People have not figured out how to optimize
that. That's a big problem.
Let's
look at this reader. This is one out of
108 people. This person has a
sensitivity of 70 percent and a false positive rate of about 25 percent. Now, this person thinks they are being as
aggressive as they should be in the context but this person is more aggressive
than this one, this reader is more aggressive than this one, this reader is the
most aggressive on this bottom curve here, and these readers are less
aggressive.
Now,
as we go in the other direction, we now see the variability due to the range of
reader skill. We can say that these
readers have a greater skill at this task than these readers and these readers
have the greatest skill yet.
At
any level of reader skill we have different readers thinking that they have
optimally set their threshold. This is
a tremendous range of reader variability.
There are 108 mammographers represented on this graph. This is classic work from Craig Beam, Peter
Layde and Dan Sullivan.
What
have I just told you? There is no
unique ROC operating point. Each one of
these people is set to be at a certain operating point. There is no unique ROC operating point. There is not even a unique ROC curve. There is only a band or region of ROCs as
you can see. There is a very broad
band.
I
hope I've convinced you all now that this gets to be a more complex issue. In particular, here is the question. Suppose we have two technologies that
manifest themselves in reader's hands with this level of variability?
How
do you compare those two technologies?
That's the issue before us with a whole class of problems that we've
been discussing over the last few years and we'll be seeing more of over the
next few years. How do you do it?
This
is not an isolated example. People have
gotten used to this and said this is really an extreme example. This is not the most extreme example we've
ever seen.
In
our group we have actually looked at over a dozen real world publicly available
data sets and the example I just showed you is sort of in the middle. Sometimes things are a little bit
better. Sometimes they are even much
worse than what I just showed you.
Sometimes things are a little bit better. Sometimes they are even much worse than what I just showed
you. The following is an example from
Dr. Jim Potchen from plain chest x-ray picking up the disease on chest
films. These are ROC curves. Dr. Potchen looked at over 100 radiologists
and 71 residents. He averaged the score
card ROC wise of his top 20 radiologist.
Here they are.
Then
he presents here the average ROC curve for his radiology residents. There are 71 of them here representing this
average line. The bottom 20
radiologists in the study performed here.
The range that we see here is comparable to what we saw in the Beam, et
al. study for mammography. So this is
the real world.
Well,
you can imagine that if you wanted to keep score under that setting you have to
use a lot of readers and a lot of cases.
The paradigm that has emerged to address this is, thus, called, almost
eponymously, I guess, if I could pronounce that word, the multiple reader
multiple case, or MRMC paradigm.
There
are a lot of designs for this. There
are many ways to do it. Today we will
just talk about something that is called the fully -- oh, I forgot my prop. We'll talk about the fully-crossed
design. The fully-crossed design is one
of many but it is the most efficient in some way so we will talk about it.
You
match cases across modalities and you match readers across modalities. If I can pull this off. I'm used to having leaves of paper
here. Okay. You have a bunch of patients who have been imaged with modality A
here. The same patients imaged with
modality B so we say that the cases are matched across modalities.
If
we were working with computer-aided diagnosis, modality A would be readers
reading without the computer aid and modality B would be readers with the use
of the computer aid. There is a stack
of images here. Same patients.
We
recruit a panel of radiologists, something like 15 of you people here. All of you read every patient case in both
modalities. What we have then is we
have the cases matched across modalities and we have the readers matched across
modalities.
This
design is the most statistical power for a given number of readers and for a
given number of cases with verified truth.
Thus, we say it's the least demanding of these resources. Around here in Rockville we speak of this as
the least burdensome paradigm because you probably heard in previous meetings
that the FDA has been commissioned by Congress to enable sponsors to seek and
to find, if possible, the least burdensome path to the marketplace through the
review process.
So
what we've done is we've always called this to the attention of incoming
sponsors that this design is most powerful.
You can use alternative designs and you can come close sometimes to the
efficiency of this scheme but this is the most powerful in terms of the ground
rules I have on the slide right there.
Well,
if you are familiar with the literature in this field, you will say, you know,
this is no modern big deal. This stuff
has been known for a good 20 years or so.
If you read the classic book by Swets and Pickett the whole idea is laid
out there. The trouble is there was no
practical way to implement this scheme 20 years ago until people started to
understand what's called the statistical approach of resampling strategies.
I
probably shouldn't spend any time on the past history but the fact of the
matter is in past years before they realized about resampling they just started
to stratify the data and then you give up a lot of statistical power. In modern times in the last 10 years people
realized if you use the statistical resampling, you can use the data over and
over again in a well-pedigreed way and get statistically valid inputs.
So
the two most famous resampling schemes are called the statistical jackknife or
the statistical bootstrap. The big
break through came in this field in 1992.
This is the classic so-called DBM paper. That's Donald Dorfman of happy memory whom we lost to out
community very sadly two years ago. His
colleague, Kevin Berbaum, and the well-known Charles Metz at the University of
Chicago.
This
paper broke the log jam in this field.
They suggested using the statistical jackknife in combination with
classical ANOVA and the statistical jackknife just being a leave-one-out method
where you leave Mrs. Jones out one time and you leave Mrs. Smith out the next
time and you generate a lot of data sets that way, submit it to classical
ANOVA, and you can do your inference about the difference between these two
competing technologies.
Well,
it turns out this is a little bit more difficult to explain in any more detail
than that. But the bootstrap method is
very trivial to explain in some detail so I'm going to ask you to sit through
that with me for the next minute or so.
The
idea with the statistical bootstrap is that we are going to -- the bootstrap
itself means you are going to resample from a set of data points with
replacement. I'll show you that in a
moment. We are going to bootstrap the
experiment of interest. We'll draw
random readers, random cases, and then carry out the experiment of interest
many times.
Here
is an example of some possible bootstrap samples from a set of -- suppose there
are 15 of you here. We might have a set
of numbers one through 15. We start
drawing them with replacement. If you
wait long enough, you might get a list that has one, two, three, four, five,
six, seven -- you have to wait a long time before that happens.
In
the meantime you get more random looking samples like this. When I was thinking about this, you know, if
you did this with letters this reminds you of that proverbial experiment where
they have the monkeys trying to type out the soliloquy of Pollonius or
something like that. It's going to
happen but you may have to wait a long time.
Instead
what you do is you get random samples like this. The number one never showed up in this group. The number two showed up once. Number three showed up a couple times. Number 14 showed up three times and so on. You randomly sample a number and then put it
back. Write it down. This can go on for an astronomical number of
times.
Then
another example, the number one shows up, number 15 shows up and so on. You get a lot of these, a very great number
of these but you don't have time to do them all so, in practice, people use
about 1,000. It depends on the
complexity of the problem.
So
you draw about a 1,000 bootstraps of readers and cases. The number of cases you draw is comparable
to the experiment you are trying to mock up.
Then what you do is with that bootstrap safe on the random case sample,
you have all the readers in their bootstrap sample read all the cases in both
modalities in that bootstrap sample, carry out the experiment of interest so
you would get the performance measure.
That's
called area under the RC curve for the one.
You get that number for the other.
You take the difference. You do
that 1,000 times and then you put them in order from the lowest different to
the highest. Then it's very easy to get
the mean and then you can take out the central 95 percent junk and that would
give you a 95 percent confidence level.
That's a simple way to explain the story.
In
the jackknife plus ANOVA it's a little bit more elaborate than that but you can
actually think of the jackknife as the first order of approximation to the
bootstrap. So these two approaches are
sort of in the same spirit but one is completely nonparametric and the other is
-- the classical ANOVA is heavily based on the multi-variate normal so it's
highly parametric.
As
I just said, you obtain a mean performance over readers and cases but it's much
more interesting. The mean is always
easy to get no matter how you approach a problem. Well, it can be tricky.
But the big thing you want is error bars that account for both the
variability of readers and cases.
You
know, in the DBM paper they quoted a quote that has become very famous from Jim
Hanley. Many of us know Jim Hanley from
McGill University in Montreal.
Jim
Hanley says, "When you report the results of your experiment to your
readership, it's not so important just to report the mean performance or the
results you got in the very experiment at hand because, after all, this
experiment will never be done again. No
one will ever do this particular experiment.
What
readers want is they want a sense of the range of performance to be expected if
this experiment could be repeated many, many times drawing randomly, one hopes,
from the same population from which the current samples were drawn. So that is the idea.
You
ought to be able to report to your readership not just a p-value because we all
know it takes p-value to get a paper published in a medical journal. You want to actually be able to explain the
range of variability you expect to see if this experiment is done over and over
again. That's what you get when you
keep score this way.
Okay. We said that the ROC curve is a
measurement. Above all else it is a
measurement so you have to think about a measurement science. You have to think about the scale you'd be
using for reporting and doing the measurements.
Historically
-- I should just stop for moment to tell those of you who were not around in
the late '70s and early '80s that the National Cancer Institute gave a contract
to people in Cambridge, Massachusetts, Bolt, Beranek and Newman, where John
Swets, David Getty, and Ronald Pickett and colleagues were working to develop a
protocol for how to do ROC experiments and how to keep score and how to do the
data analysis.
That
is published in a paper in science 1979.
The book came out in 1982 and many of us have that book on our shelf. The protocol used at that time was so-called
historic ordered category scales. There
was no does this patient go to biopsy or not.
You just looked at the case and you said this patient -- you use five or
six categories.
One
patient you might say this patient almost definitely does not have
disease. There are several intermediate
levels. The patient probably does not
have disease, might have disease, probably does have disease, or almost
definitely has the disease. That scheme
of five or six categories was almost exclusively used and there was software
for analyzing that for 25 years.
I'm
being a little defensive because people may say why do people use that. That was approved by -- the experts in the
field put it out and it was supported by NCI.
There was a lot of science underneath it and today people say, "Why
did people do that?" Well, that's
what they had.
In
the last 10 years in the field of mammography we have this BIRADS scale which
is what we call an action item or a patient management oriented scale. In that idea you don't categorize the
data. People think of the BIRADS scheme
as a categorization scheme. Let's just
put that to the side for a moment.
We'll
just think of using the BIRADS scale to dichotomize patients. We'll say these patients will not be
followed up at all versus these patients who will get a six-month
follow-up. That's one way to
dichotomize the data.
Another
way to dichotomize the data is to say we will try to make the break as we did
with the Beam, et al. data. We'll make
the cut in this dichotomization between those patients who would get six-month
follow-up versus those who we think should be biopsied right now. So this is a patient management
scheme. This is just a dichotomization scheme.
About
10 years ago people realized for very technical reasons that it would be useful
to use what they called the continuous probability rating scale, or
quasi-continuous. It's a hundred-point
scale, one, two, three, four, five, but you wouldn't get 1.5 for example so
they call it quasi-continuous, hundred-point scale.
Nobody
expects anybody literally to use probability 13 or probability 17 or anything,
but the idea is to scale your probability or your sense of the likelihood of
disease along a probability scale. That
seems natural to use something if it's a probability on a scale from zero to
100.
So
this is the most popular scheme that's been used to generate ROC data in the
last five or seven years or so. This
felt strange to many people, especially people who are used to using the
categorical scale. But I've talked to a
lot of people about this and very few people outside of the mammographers have
read the BIRADS document.
If
you go through the BIRADS document and you go to category four, which is
suspicious and recommend for biopsy, it actually tells you there that the
radiologist should tell the referring physician their sense of the probability
of cancer. There is actually a culture
already existing in which you can use this kind of patient management action
items like a BIRADS three, four, five, and at the same time give a continuous
probability of disease rating.
I
see some puzzled looks. I'm trying to
figure out just what I should comment on next.
So to make a long story short then, this continuous probability rating
scale has been used for most ROC curves generated in this community for the
last eight or so years. In the breast
imaging --
Oh,
I remember what I was going to say.
That's why I'm stalling here. In
the breast imaging community many people, it may not be more than half, but
people do use this BIRADS scale. But
it's really important to realize that this BIRADS scale was not generated --
was not designed to generate ROC curves.
People who have tried to
use a five-category scale in this scheme and the BIRADS scale at the same time
have met with a lot of confusion. It
does not work out very well and I see somebody who may have witnessed people
having that experience.
Well,
I gave a lot of background here because I would like people to understand that
this is a real issue for the community you would really like to have because
every clinician says, "I want to know the patient management and I want to
know the score card of the patient management." Every clinician you talk to, that's what they want.
Everybody
who measures ROC curves says, "I want to measure it as finely as I
can. I want to use this
quasi-continuous reporting scale."
The best of both worlds would be to get both the quasi-continuous rating
to get the ROC curve and the patient management action item to get a single
sensitivity specificity point.
I'll
get a little dramatic for a moment here.
I've talked to many friends. I'm
very familiar with the literature. I
could find one example in all the literature at the moment that's in print
where both of these were done. I could
only find one example of where the best of both worlds was done. This
is a paper on classification, what Bill Sacks and others called CADx using a
computer not to detect but to classify lesions on a film that are already
known. I know that I have a stack of
films here that have microcalcification clusters on them. My
task is just to say which ones are benign and which ones are malignant. That's the task. But I'm going to keep score ROC wise and I'm also going to keep
score patient management wise. I'll
show you what they got in a moment.
These
authors -- Yulei Jiang, I guess, was expected here today from a group in
Chicago under Kunio Doi. They studies
this test and they had 10 readers and they studied the complete ROC
curves. They studied all the summary
measures and they also studied the patient management or the action item,
sensitivity specificity point.
Here
are the results. Here is the average of
10 ROC curves for 10 readers trying to make this dichotomy, trying to make this
distinction between benign and malignant lesions. Here is the ROC curve in the unaided by computer condition. This curve was generated using the
hundred-point probability scale.
This
is the curve in the computer-aided condition, again generated by the
hundred-point probability scale. This
point is the mean sensitivity specificity point generated just by making the
threshold, dichotomizing the data.
These patients benign, these patients malignant. This is a single dichotomy patient action
point in the unaided condition.
That's
the same point in the aided condition.
You would love these points to fall on top of the curves and, for all
statistical purposes, they do because remember the mean -- I have to remind you
of this famous joke that we use around here.
There was a six-foot statistician.
You know what happened to this fellow, right? He drowned while wading in a stream that had an average height of
five feet. You have to know about the
variability.
This
is not about means, okay? This curve
moves all over the place and this curve moves all over the place in
practice. This is the average of 10. Same thing. This point moves all over the place as does this. For all practical purposes this is a great
experiment. This point falls on that
curve.
Well,
it's the only case I could find in the literature. How come you don't see more of this? When you live with these people that I live with, it's a great
crowd of people and the clinicians say, "I want the action
point." I say, "The committee
wants to measure the ROC curve."
Everybody says, "Let's do both." We are trying to come to that position. Why don't we see more of it?
Well,
the area under the ROC curve, remember, you have your ROC curve and you've got
the area under it. You are essentially
getting the sensitivity averaged over all specificities. Right?
You're averaging. You're going
to average away a lot of noise.
The
variation -- the variance of the area under the ROC curve -- oh, my
goodness. The most important number of
my entire talk is missing. The variance
of the area under the ROC curve is the binomial variance over two. There's a two here, a very important
two. Those of you who know me know I'm
an expert in factors of two. It's the
binomial variance over two.
What's
the binomial variance? Well, I thought
if you had a group as we have here today, about a third of you -- maybe 40 percent
of you as I look around -- know what the binomial variance is. Suppose we had this meeting next week and we
drew from the same population from which you all came.
The
next time we did it we might get 32 percent of you might know what the binomial
variance is. If we do it three weeks
from now and joint another group in, maybe 49 percent or 52 percent of you will
know what the binomial variance is.
What
we've just done is what Bill Sacks refers to.
We just made a self-referential example here. The binomial variance is the variance I would experience if I did
the experiment I just discussed with you.
The area under the ROC curve experiences only half of that variance.
If
I studied sensitivity by itself and was able to tell you ahead of time what the
specificity was so you didn't have to estimate the specificity, the variance of
sensitivity is the entire binomial variance.
In
the real world you have to estimate both the specificity and the sensitivity so
the uncertainty in the specificity propagates into that and the sensitivity so
the variance for that. So if you wanted
to estimate the uncertainty in that action item that I showed, that point, the
circle or the triangle in the previous data, if you were to estimate that, you
would have to live with an uncertainty that was greater than the binomial
variance.
If
you use area under the RC curve you get a great reduction. You get the binomial variance over that
famous factor two. This is all
approximate but it works out very well with very practical examples.
So
what we say is that the variance of the ROC area is the least burdensome
approach to putting quantification into this problem. I remind you that is something that we are supposed to enable
sponsors to appreciate.
Another
thing that we realize in many discussions with academics and within our house
and with the sponsors and so on is if you want to live in both of these worlds,
that requires consistent conventions.
If you want to be able to either get categorical reporting and the
BIRADS reporting, that's a lot of work to try to get people to be consistent
that way. People have dropped the
categorical scheme for all practical purposes.
Even
if you want people to be consistent between BIRADS and the quasi-continuous
scale, that's difficult. We've seen a
lot of data in our own group and from some of the universities. When you train people, this can be done but
not everybody is trainable right away to be able to do this so it's an
issue. To get data in both worlds then,
it's going to require some convention development.
My
final point here says this may require consensus bodies to promote the
practice. We would hope that the
American College of Radiology, some of them other professional societies, and
even the fact that this is of interest to NCI and the FDA, we would hope that
some this would encourage people to try to do measurements so that we could get
both the point and the curve. Then I
think everybody would be happy.
Well,
this brings us to a little interim here.
Some of you are very familiar with the next few slides. These are what we call the most famous
slides in the RC archives. Those of you
who know Charles Metz have seen this many times and his followers will use
these many times. Charles died using
these slides over 25 years ago.
Here's
the classic question. You have two
diagnostic modalities, modality A and modality B. Which one is better? You
look at them and you have people doing public policy thinking in their
minds. Which one of those is
better? You start calculating something
you've seen in a statistical decision theory book.
But
the way this is approached in the field of medical imaging is the
following. There are several
possibilities here. Those two points
may lie on completely different ROC curves.
In that case we say that modality B is unambiguously better than
modality A because at any false positive fraction the sensitivity of A is lower
than that of B.
There's
a different scenario. The two points
could fall on the same ROC curve. Then
you have these same people scratching their heads and saying, "Where
should they really operate?" Well,
in principle we believe that readers can move their level of
aggressiveness. Not on any fine scale
but we know that they adjust depending on the risk group their seeing. Some people do move around on their ROC
curve so in principle these two points are in equivalent modality.
As
I say, people will for years say, "There must be one of these operating
points that's better than the other."
Remember when I showed you that data from Craig Beam you saw people at
every level of aggressiveness. Each one
of these people in some way thinks they've optimized.
This
is what we call the expected utility function or the expected value function. Every one of those people thinks in some way
they have found the optimal operating point but they disagree with each other
so this is another reason for using the ROC method.
There's
yet another scenario. ROC curves may
actually fall in such a way that modality A is everywhere higher than modality
B. For the same reasons we would say
that modality A is the superior modality in this scheme. Three different possibilities. B higher, equivalent, A higher.
This
is the motivation for trying to get a finer measurement on this hundred-point
scale. Then if the clinicians really
want to know about the actual operating point, that is another step and we are
all for that if you can coordinate the measurements but it's very difficult to
do that.
Well,
I'm sure many of you are sitting there thinking what about if the ROC curves
cross? We know if that happens the
situation enters the world of ambiguity.
Then you can no longer necessarily use the total area under the curve as
a sufficient summary measure of performance.
Other
summary measures may be necessary.
There are any number of other ways to make a summary measure of curves
that cross. You can use partial areas. There's actually software even for that
today. Or you can use parametric
summaries of the curve and there are several other ways to look at this.
If
you decided you're going to use other summary measures, if you anticipate this
possibility, the study protocol is expected to address this because if you wait
until after the study and say, "I was going to use the partial area in
this region," we have a name for that.
That's called data dredging. You
have to build that into your study up front.
Otherwise, when people do not expect to see the curves cross in any real
way, they tend to use the area under the curve as a summary measure.
Well,
for submissions as are coming before us in the area of computer-aided detection
schemes, there is a question of how do you keep score for the location
scored. I must remind you this is
shocking to people who have never heard this before.
The
basic ROC paradigm is an assessment of the decision making at the level of the
patient. You don't say, "Where
does the patient have diabetes?"
You say, "This patient has diabetes." Or you say, "This patient has
TB." You don't say, "The TB
is here." You say, "This
patient has TB." So the score
keeping until recent years has been based on decision making at the level of
the patient.
In
more complex imaging you want to do the assessment of the decision making at a
finer level. You would like to assess
how well the localization was done.
Well, there are little errors there that come across funny. If you do localization, of course, you will
be providing the experimenter with more information.
If
you have more information in the study, you get more statistical power. The trouble is to do all this adds
complexity to the experiment. I would
just like to review for you a couple of the highlights of the issues that have
come up when you try to do location specific ROC analysis, so-called LROC for
location specific ROC analysis.
The
biggest problem is that if you want to keep score of a hit, the measurement of
the hit depends on the criterion you use for localization. If the legion really is here somehow and you
draw your circle and you say the legion is here, there is a certain amount of
overlap and you would be surprised to see how sensitive the measurements are to
that degree of overlap to the criterion you use for that. That's a real issue. There's no unique result. There's no unique LROC curve at the moment
for the state of the field.
There
are a couple of subtle points here that are very technical. I would just like to mention one of
them. People have studied this for 20
or 30 years. For a certain class of
problems if you study the ROC and if you study location specific ROC, the
curves in the summary figures tract with each other monotonically.
If
the one goes up, the other goes up. If
one goes down, the other comes down.
They might change at different rates but they go together
monotonically. So people haven't felt
bad about just using ROC analysis instead of LROC analysis if they were willing
to invest the extra resources because you will lose statistical power.
But
people have been willing not to go to this level of complexity and to go to
that higher level of complexity requires more elaborate models, more elaborate
assumptions. These are still debated
until today. You can see in the SBIE
handbooks that people are debating this back and forth, Charles Metz and Dave
Chakraborty.
But
I must mention that a lot of progress has been made in this field. The bottom line of this slide if you haven't
followed any of this is that essentially there's a lack of validated software
for analysis of such experiments. Now,
Elizabeth and the MIPS, Medical Image Perception Society, website actually has
software for several of these approaches.
The
writers of that software feel very good about the state of their software but
there continues to be discussions in the field about how far have they
validated. Have they checked whether
the alpha level and the reject rate are agreeing and what is the power and so
on.
The
debate goes on but I expect that people coming down from Pittsburgh any day or
any week now saying, "You've got to start using this because it's been
validated." That's the state of
the knowledge right now. There is
software there but there are still people discussing the condition of the
validation of the software.
So
a few years ago to find some kind of a happy medium Nancy Obuchowski of the
Cleveland Clinic and colleagues said, "Why don't we just simplify the
task? Why don't we do something called
region of interest location specific ROC analysis. Let's only require localization to within a quadrant so you don't
have to say there's a lesion here or a lesion here. You just have to say I see a nodule in this quadrant. You require localization only up to a
quadrant."
Similarly
for the other quadrants you could say, "Why didn't we do it for octants or
16 fold or 32 fold?" Well, you
could. This is sort of the entry level,
this problem, but as you add number of possibilities, then you get more into
questions of overlap and ambiguity so people have decided, "Let's start at
the level of just quadrants." As I
say, sort of the entry into thus problem.
Continuing
on discussing this so-called ROI approach, the location specific ROC analysis,
right away Dave Chakraborty jumps into the literature and say, "Wait a
minute. This doesn't correspond at all
to the clinical task." People have
debated that back and forth whether it does or not.
But
from the other wing of this Greek chorus comes the methodologist to say,
"Yeah, it may not be quite right but it's really straightforward to
account for correlations without getting into these assumptions that people
have debated for a while."
What
do I mean by that? Here are four
quadrants, the right side of the lung, the left side, the top, and the bottom
if you will. Whatever is going on in
this quadrant is expected to be correlated with what is going on in this
quadrant, or at least could be, and similarly across the quadrants.
After
all, this is the same person, has the same genes, experienced the same
environment, and had a picture taken with the same imaging system. One has to allow for the possibility that
these quadrants are correlated. The
nice thing is that Carolyn Rutter and others came by another year later and
said, "Wait a minute.
All
you have to do to preserve those correlations is when you resample you resample
on a patient basis. You can't start
resampling products this one from this person and this one from that
person. You have to resample on a
patient basis so if I sample you, all four quadrants from you come into that
sample and so on. When you do this, you
actually preserve the correlation structure and you are said to be using the
patient as the independent statistical unit here.
Well,
that's all I'll be saying about location specific score keeping and now to one
of the real problematic issues in the submissions as we'll be seeing in the
next couple years. This is the problem
of uncertainty of truth state. There's
a classic paper that all of us have almost memorized by now from Revesz, Kundel,
and Bonitatibus 20 years ago.
This
is Harold Kundel known to many of us as one of the pioneers of this field, the
mentor of someone on our panel today, who was at the Temple University, and now
is at the University of Pennsylvania emeritus.
These authors, what did they say?
They included various ways of obtaining panel consensus truth.
They
actually did a study comparing three different ways of doing chest imaging and
they had the truth but they set the truth aside. They said instead of depending on the truth to keep score, let's
get a truthing panel. What they found
out was they had several ways of obtaining consensus from that panel. They
could either use unanimity. They could
use majority. They can use some kind of
expert review. They have three or four
ways of reducing this panel to truth.
They compare three imaging modalities, as I said, and here's what they
found. Any of the three imaging
modalities could be found to out perform the others depending on the rule you
used for reducing the panel to truth.
So
this sobers a lot of us in the field about using a panel as truth. However, today the target of this experiment
we'll be discussing today is not to say this is a nodule that is a cancer. It is only to say this is a target. This is a region that a panel of experts
would consider to be an actionable nodule.
We're
not trying to keep score based on the truth.
We're trying to keep score based on what would a panel of experts
do? Would they cue this region or not? Nevertheless, even though we changed the
target, this classic reference above tells us that there's going to be
additional uncertainty because of this panel.
The panel will have variability in it and if you go to RSNA over the
last few years, you'll hear papers on this subject.
What
we've said to incoming sponsors is that we strongly encourage you to resample,
to come up with some resampling schemes to resample the panel to get a feel for
the additional uncertainty that comes into this problem over and above the MRMC
paradigm, over and above due to the fact that there is noise in the panel. You can start to see why there is no canned
software to do this problem.
Well,
since the truth is uncertain, it turns out that leads to uncertainty, in
effect, in the number of samples you have.
Let's talk about designing an experiment for a moment. Suppose you want to design experiments that
are going to have very tight error bars on the sensitivity. Everybody know that if you want to do that,
you want to have a lot of actually diseased cases to tighten up the error bars
this way.
If
you want to tighten up error bars the false positive way, you wouldn't have a
lot of actually non-diseased cases. If
your endpoint is the area under the RC curve, what distribution should you have
between nondisease and disease cases?
Well, it turns out it should be some kind of average between the
two. It turns out that the number you
should be using is the harmonic mean of the numbers in the two classes.
The
numbers in the two classes is going to depend on the panel, right? Because some of the panel members will say
these are diseased and others will say these are diseased. The actual number of diseased cases depends
on the panel. We have uncertainty in
truth that leads to uncertainty in the number of samples.
This
is almost a trivial curve and I'm just going to tell you about the highlights
because we think it might factor in today.
Suppose you are told you can design an experiment with 100 patients. You say, "How should I distribute
them?"
Well,
you distribute them, let's say, at the beginning of an experiment like this so
that you have 20 that are actually nodule containing cases, 80 non-nodules, 20
nodule containing sites so we have an 80/20 break.
This
effective number, this harmonic mean of those two numbers, is 32. Whereas if I make a more even split, 60/40,
50/50, for 60/40 it would be up in the 40s the effective number. On a 50/50 split the effective number of
samples for that experiment would then be 50.
That's not surprising.
The
reason we're showing this is suppose you start out with an experiment like this
and you are requiring unanimity in the panel to declare a nodule-present. Then suppose you relax that criterion and
say instead of requiring unanimity, we'll just require two out of three. Then you expect that whatever the number was
before you're going to move up this curve.
So
you are sampling variability, losing power, but gaining samples. You may tend to cancel. We don't know this. We are speculating about this. We'll discuss this. What I just said is if you want to get into
the realm of resampling your panel, you could start by relaxing the panel
criterion from unanimous to majority and there are several other ways of doing
this.
This
is just, again, an entry level. When
you do this, this gets you into the game.
This allows you to resample, to assess the variability, but it may also
increase the effective number of samples.
These effects may tend to cancel.
This is, again, speculation just based on the direction of these
effects.
The
last thing I want to talk about today is the problem of controlling for reader
vigilance. When you do an experiment,
with my two little pads of paper here, when you read in the unaided reading
condition versus reading the aided reading condition, there are some people in
this room who may be competitive.
If
you're reading in the unaided reading condition you say, "The computer is
about to tell me what it thinks."
If you are a little bit competitive, you are going to say, "I've
got to be careful when I read this."
You may increase your vigilance.
How
do you mock up? How do you do this
experiment? This is a challenge that
hasn't been quite sorted out. Any
measurement setting has an artificial condition compared to the actual real
world of practice. What I just
described to you is the possibility that some readers might be more vigilant in
their unaided reading because they know they are subject to the site.
Well,
when you turn a modality lose in the real world, just the opposite could
happen, right? The readers might be
less vigilant in the real world because they know, "Well, I can brush
through this. The computer is going to
give me what it thinks in just a minute."
In the real world the vigilance could go down. In some experimenters it could go up and I think we've seen
experiments when the vigilance didn't change but I'm sure you can guarantee
that.
The
only thing we've seen in the practical solution to this problem, Heang-Ping
Chan and colleagues about a dozen years ago wrote a paper in which they said,
"Look, this is a real issue, this vigilance.
How
do you do a controlled experiment controlling for reader vigilance?" They said, "Well, just simply control
the time available to readers in the unaided reading condition to mimic the
actual clinic. That was a suggestion I
made. I don't know how many people have
tried that yet but that's in the air.
Well,
you can all take a deep breath now.
We're in the summary. Here we
are. This field has been going on for
30 years. In the last 10 years the
whole issue of reader variability has complicated it and there have been ways
to promote it to address the issue of reader variability.
In
the last few years we've had to deal with the complications from location
uncertainty, from uncertainty in the truth, this issue of reader
vigilance. What we've tried to do is
this is like a quadrangle, as I said.
We hear it sitting at the FDA and also doing some research here.
We
have our academic colleagues doing research in academia, industry sponsors
doing research on all these issues in another side of the quadrangle, and NCI
and the Lung Image Database Consortium that we've been very actively working
with and who are very interested in these issues.
We've
tried to hold the windows open so that this quadrangle from all courts has been
open to everyone. Whenever industry
sponsors have come in with issues like this we've said, "Look, the windows
are open.
Here's
what is known from all these quarters.
Here are the papers. Here are
the drafts that are not even published yet.
Here's what we know at the moment.
We don't have guidance. We can't
say this is where the FDA or anyone is holding the bar but this is all the
knowledge that we have at the moment."
There
is no canned software. There's canned
software for little pieces of this problem so any industry sponsor would have
to be creative to come forth with a novel way of putting all these pieces
together.
Well,
that's the state of the world as we know it today. Thank you very much for your interest in this. Oh, there's some papers. The "tz" are obviously Charlie
Metz's papers. There are a few papers
from our own group in which we have actually worked with Charlie Metz and our
own statisticians and our clinicians try to review the state of the world.
This
is the first LIDC document. It's going
to come out in April. Then in your
notes there are many other pages of references.
DR.
IBBOTT: Thank you, Dr. Wagner. Before you go too far, I would like to ask
if there are any questions from the panel for Dr. Wagner.
DR.
KRUPINSKI: What's the consensus? I mean, the quadrant problem gets rid of the
localization problem if you end up with a nodule in each quadrant. What it still hasn't addressed, what do you
do, for example, when you've got two lesions in a quadrant?
DR.
WAGNER: That's right.
DR.
KRUPINSKI: You still have that basic
uncertainty.
DR.
WAGNER: That's right.
DR.
KRUPINSKI: The flip side of that is
what if there is a false positive in the quadrant along with a true
positive? You've just simply squished
it --
DR.
WAGNER: That's right.
DR.
KRUPINSKI: -- into a quadrant and you
still have avoided the localization problem and the problem of a false positive
and true positive.
DR.
WAGNER: That's right. That's been sidestepped. As you know, the higher levels of software
attempt to address this one way or another and I think the jury is still out on
whether we are ready to use that. I
think the inventors of those other methods think they are ready to go and they
might be but we also know there are people in the wings saying I'm not sure
about these assumptions and so on. That
software does not have general providence right now. Maybe that's too bad.
Maybe it should be. These are
real issues.
DR.
BLUMENSTEIN: I'm impressed by the MRMC
study design. I think that's a nice
step forward. I'm wondering if anybody
has ever subjected the same reader to the same image multiple times and studied
the effect of that so that you could get at this issue about how a single
reader uses their own personal scale?
DR.
WAGNER: Yes. That's a classic question.
There are experiments on that.
I'm making this up but this is the spirit in which I remember it. David Getty has shown some data on this in
mammography and I think that readers are correlated with each other in the 60
percent range and are correlated with themselves only 70 some percent on
repeats. There is, indeed, a lot of
reader variability intro.
However,
you get more bang for buck -- if you want to spend so much time in radiology
reading-wise, there's more bang for buck to get a different reader than to use
the same reader over again because you are so correlated with yourself you get
more independent information if you bring in a sample that's not so correlated
with the preceding reads. Bank
for buck-wise people have said this is a question of reading time. People have not in the MRMC paradigm in
general tended to have readers reproduce their readings. You can do it and there are terms in the
model to accommodate that, of course.
It's just not common.
DR.
BLUMENSTEIN: Actually, you took my
question as a suggestion maybe of changing the study design. I didn't make it clear. What I'm actually concerned about is whether
the methodology that's been developed to give p-values, estimate variance,
which you rightly point out are the big issues here, whether those properly
account for intra-observer variability in their use of the scales?
DR.
WAGNER: I believe it does and I'll tell
you why. The full model has seven
terms. I won't take you all through all
of those seven terms. Pure case, pure
reader, various interactions. One of
them is a three-way interaction between modality reader and case.
That's
the sixth term. The seventh term is
what you're talking about. It's the
lack of reader reproducability. If you
do enough experiments, you can identify so-called in statistical language. You can separate these two. If you don't do the right experiment, you
can't but they get lumped together. The
term you're trying to get at is the reader inconsistency. That is sampled in the experiment but it
cannot be identified. It cannot be
broken out but it is in there.
In
fact, the way we do it is we do it with a family bootstrap experiment so we can
actually put out all these effects but we cannot pull out the MRC from the
epsilon. They come together. That represents not only this three-way
interaction but represents the inconsistency of all the data sets
together. So that is actually in
there. Are you surprised?
DR.
BLUMENSTEIN: No, no, I'm not. But since you don't measure that in the
experiment, you can't estimate it obviously.
That's the issue. I guess what
I've been concerned about ever since I first heard about the use of ROC curves
where the reader is recording their result on a subjective scale either
categorical or probability or whatever it is.
It's a device to get
you to the point of being able to use ROC methodology. What has always concerned me was that there
was this underlying source of variability that wasn't taken into account in the
models that you are estimating. It's
only if you do the experiment that way that you actually get an estimate of
that intra-observer or whatever you called inconsistency or whatever.
DR.
WAGNER: Right.
DR.
BLUMENSTEIN: I just wondered whether
the degree to which this has been studied in actuality.
DR.
WAGNER: Not very much because of the
bang for buck point. As you can see, if
you are inconsistent with yourself, and everyone is, that will show up in case
to case within a given experiment but you won't be able to peel it out but it's
in there and it's accounted for in the inference. It's a subtle point but we can discuss it.
DR.
TRIPURANENI: That was an excellent
presentation, Dr. Wagner. We used the
MRMC for the intra-observation. If you
are looking at two different modalities such as a chest x-ray or a cat scan,
have you looked at whether there is any difference in the intra-observation
between one modality to the other modality?
DR.
WAGNER: It turns out to be a really
neat point actually. Our own group has
three papers on this subject. In the
first one, you want to know if you can see the difference in the variance
structure between the two modalities.
Is that what you're asking?
DR.
TRIPURANENI: That's right.
DR.
WAGNER: There's a model that has six
terms. We were just talking about
that. There another model that -- you
would think you would have to go to 12 terms to do that. It turns out there is a parsimonious way to
do it with just nine terms but two ways to do that.
When
you do it you find out that the extra issues brought up by the wrinkles you
were just discussing, they come in in such a way that they average and it's
only their average that goes into the inference so you can forget about the
issue. It's a really interesting
issue. We have two papers on it. But you could forget about it. You could from right off the metro just hear
about this and say, "I'm going to use the DBM software." You could forget about the difference in the
variance structure across the competing modalities and if you do, the inference
is still the same inference. It doesn't
matter. It's a really interesting
point.
DR.
IBBOTT: Dr. Solomon.
DR.
SOLOMON: How do you -- I mean, I have a
feeling this topic is going to be discussed throughout the day but how do you
translate changes in ROC curves into clinical significance? Especially since if you look at an
individual's change in the ROC one person might do worse and another person
might do better and then how do you make that determination?
DR.
WAGNER: Right. Well, you might have been a fly on the wall
in many meetings. I mean, this is a
real issue. Dr. Sacks will say
something about it later on. All I can
tell you is that the most statistical powerful method to get at these
differences is the one I've discussed today.
We really would like -- well, I take you
back to the Yulei Jiang stuff. We
really do want to see those action items.
You can't go from the curve easily to the action items if you haven't
measured those action items. Is that
what you're getting at? I'm not sure I
see what you're getting at.
You
want to know how we can go from this ROC summary and inference to an
interference to the clinic. Is that
where you're going? I think it's
difficult. What we're saying here is
what we are doing is we are making a measurement that averages over all these
variabilities that we have talked about.
It averages over all that and here's the summary.
If
you want something more clinically relevant than that, you would have to
actually measure the action item, the dichotomization, if you will, and give it
error bars. When you finish the problem
is here would be the action item sensitivity specificity for the one modality
and here it would be for another one or this way. Now, what do you do?
Suppose
they go this way? What are you going to
do at this point if they don't match up sensitivity wise or specificity? What are you going to do? There are things you can do but you have to
start getting into expected utility analysis.
I didn't mention it but I have some very strong professional opinions on
this.
I
think it's impossible to do that because to do the expected benefit analysis
you need to have an idea of the prevalence of the disease and that changes from
risk group to risk group so that is a big uncertainty. You have to have a sense of something called
the utility matrix, the number of false alarms that you are willing to trade
for a hit, if you will, different from the positive predictive value.
You
have to have a sense of that utility matrix and you have to actually know the
ROC curve already because all these things come in. I think this is almost impossible to do without this being taken
on at a national level.
You
can see from the data of Beam, et al. each one of these people thought that
they were working out the optimal operating point and have completely different
points of view. What I'm saying is
that's an important question. I think it's a societal question.
I
think it's very complicated and it calls for a lot of wise people with a lot of
data to sit down with professional societies and say, "Where are we and
were do we want to be?" This is a
really big issue. I don't have an easy
answer. I insist to my colleagues there
is not an easy answer.
DR.
IBBOTT: Brent.
DR.
BLUMENSTEIN: I think it is the key
question. What we are asked to do here
is to basically judge whether this difference in the area of an ROC curve --
DR.
WAGNER: That's right.
DR.
BLUMENSTEIN: -- has any translation to the clinical setting. What we're lacking we have a measure of the
significance of the difference in the area of the ROC curve. What we don't have is a measure of
uncertainty around the clinical interpretation of the ROC curve.
This
is what is particularly bothersome to me is I don't know how to do that and I
don't see any methodology that gives me that answer. I'm concerned that we have started building a building with a
foundation using subjective scales to measure things so that we can use ROC
methodology and we are using resampling methodologies to do this.
We're
not taking into account all the various sources of variability and so forth so
we are way out there and our foundation may be collapsing and not giving us
what we need with respect to the clinical outcomes.
DR.
WAGNER: Well, if this was broadcast on
academic TV today, apoplexy would abound in the community because we all feel
we are building, as you say. We're
building on decades of people trying to measure complex perceptional
phenomenon. This is where we are right
now.
It
may not be the ending point to which you would like to be but this is about the
best of where we are at the moment. I
tried to challenge you a moment ago if you wanted to work on any action
oriented clinical endpoints, I think it's very difficult to sort that out.
It's
very difficult because you'll get bigger error bars and it's very difficult
because the expected utility problem is one that every person in this room has
a different answer to that problem. I
think it's very difficult. I agree with
you that we are constantly besieged by our clinical colleagues who would like
to have better answers to this problem.
One
case which is kind of unambiguous is the Yulei Jiang's data that I showed you
had an ROC curve that went up. The
unaided condition was lower. The action
item, the dichotomization went from a certain sensitivity to a higher
sensitivity and a lower false positive fraction.
I
think everyone loves that scenario.
Wouldn't you say? That's the
world we want to live in. Right? That doesn't happen a lot. These more ambiguous things happen more
often. So what we can do is average
over the relevant parameters and say this is what we found.
In
principle if one ROC curve is higher than the other, in principle one can
operate at a given false positive in one modality and increase the
sensitivity. For every time B is higher
than A, if the specificity is here and the curve is everywhere higher, in
principle I can operate at a higher sensitivity. In practice how to do that, wide open. This is a professional society issue that is bigger than all of
us. That is a really tough
question. I agree.
DR.
BLUMENSTEIN: And just to throw one more
complicated issue into all this is that a lot of this stuff that you presented
here assumed that the modalities were assessed independently. In other words, modality A versus Modality B
but the experiments that we are asked to look at are modality B added to
modality A.
DR.
WAGNER: Right.
DR.
BLUMENSTEIN: Where the experiment
itself has built-in constraints with respect to how one behaves in doing
that. I don't see that taken into
account.
DR.
WAGNER: No.
DR.
BLUMENSTEIN: And I'm concerned about
that.
DR.
WAGNER: This is a point of confusion. I
would disagree with you. The modality A
here is the reader unaided. Modality B
here is adjuvated, the reader aided by the computer aid. This a standard paradigm and it actually
corresponds to an experiment in the real world that you would like to do.
It
may not line up exactly with the clinical setting but you actually would want
to know something about the performance of readers unaided and then you want to
know about how they would perform in the aided condition. That is actually the comparison of interest.
DR.
BLUMENSTEIN: I realize that but the way
in which the data are recorded is such that the judgment -- as I understand it,
the judgment under A is there and has never backed off. You could only improve.
DR. WAGNER: Oh.
DR.
BLUMENSTEIN: And that's not taken into
account in any of these models that I see.
All the models that you presented, everything that you said, is based on
having an independent assessment of the two modalities.
DR.
WAGNER: Well, you have also touched on
something that we have had a lot of discussions on. These are real issues.
I'm not making light of anything you're talking about here. One hopes the day will come when these
modalities are really good. These
computer aids are really good and then you'll be allowed to back off. You could depend more heavily on the
modality.
Today
people are being encouraged not to back off but the measurement doesn't require
them not to back off. They are just
encouraged, "Do not back off," and there is a basic reason for that I
think Dr. Sacks will explain later on so people are encouraged not to back
off.
But
when the systems are really good as they are in mammography, these
computer-aided systems in mammography are almost flawless for picking up
clusters of microclassifications. They
are far from perfect for masses but they are almost flawless for
microclassification clusters so readers have thrown away their eye loops, a lot
of them that are using these systems so they are willing to depend on the
computer.
I'm
just giving you the only anecdotal evidence.
You have a really good point. I
don't have a really good answer to it but in principle it doesn't have to be
this way. At the moment it is this way.
DR.
IBBOTT: I would like to remind everyone
we will have time to discuss this specific proposal in front of us later on
this afternoon.
DR.
STARK: May I ask a question exactly the
point of the presentation, I believe?
DR. IBBOTT: Yes, please.
DR.
STARK: Using the classic -- thank
you. That was an outstanding
presentation.
DR.
WAGNER: Thanks.
DR.
STARK: Let me just get to the point
because I know we are running short on time.
With a better test the AB test in come context in terms of clinical
utility, either one that had less scatter.
You showed the Beam paper where the radiologist skills cause scatter in
the distribution of the family of curves.
It
would seem to me that there would be two criteria applicable here where we have
a different choice where the test with the larger Az is not the better test if
that test is less flexible -- I'm sorry, has a larger scatter in terms of
variability of radiology performance, radiology implementation creating a
management problem, the implementation problem and then the clinical utility
problem where all of the fabulously sophisticated group here are focused on.
The
other area where the larger Az -- so if there is more scatter in the test with
the larger Az, it will likely be an inferior test, more cumbersome, more costly,
less safe and less effective in clinical utilization.
The
other thing is that if there are two tests with comparable scatter but is
easier to train with experience or inexperience, so if you have a trained panel
of readers like you do under these study conditions under very circumscribed
conditions where they know they are in a test and are not distracted by
clinicians, by the busy realistic environment of all mammography or chest CT
practices, you can have a curve that is more pliant in the direction that you
want doctors to either start at with distractions or to move into with
experience so it does seem to me that the scatter or the flexibility of the
performance.
The
ROC curve I think is unassailable and I have learned -- I have enjoyed a ton
here learning from Dr. Blumenstein's analysis, yours, and those of you have
seen whatever I wrote here. My group
had to do this 20 years ago. We
published papers on ROC analysis and I know we're on the right -- I believe
we're on the right foundation.
I
think this is the right place to start but the breath of the challenge facing
us all here today is let's not get obsessed with the ROC curves. I know we have the whole day for this but
the safety and effectiveness of this is going to be what happens when you drop
into a clinical environment.
And
we have a lot of experience with breast and this panel has a lot of people
experienced on it but can you tell me if you would agree that we need to see
the scatter in these Az plots and know how they respond to inexperience or
training to really know of the larger Az is better.
DR.
WAGNER: Well, I would say that I think
there is a little bit of second order phenomena here that is important. Just because something is second order
doesn't mean it's not important. For
the practical inferences that have been -- the endpoints of studies we've seen
to date, it has been the performance in the mean.
People
have addressed that. There is
software. We have several papers on how
to do just what you say and how to split out every piece so we can see how much
variation is from the cases, from the readers, from the various
interactions. There is actually
software to do that and we are encouraging people who operate at a higher
level, say NCI or some academic consortium, to address these very issues and we
can see it. We know how to peel all
this stuff apart. As far as the
inference on the table today, it was not done.
DR.
STARK: The burdens would be huge. I mean, the sample sizes, the whole time period,
the number of people that have to be involved.
DR.
WAGNER: That's right.
DR.
STARK: That's why you talked about the
need for national studies and we would all like to do that in oncology and
everything but we have to treat people and make decisions today.
On
the other hand, let me ask my final question.
Are you aware, or is anybody aware of any evidence that a p-value or
some other statistical measure comparing your test A, B under whatever
conditions, today's conditions or the ones I am dreaming about, we hope it has
some clinical relevance but couldn't it all be counter intuitive? I mean, this is a very subtle business and
couldn't we be missing the forest for the trees here?
DR.
WAGNER: Again, that's a very wise
question and I think that is why we have several medical officers involved in
our center on the panel here so I'll defer to them.
DR.
STARK: So the p-value of .003 doesn't
necessarily mean a thing.
DR.
WAGNER: I defer to my clinical
colleagues for that.
DR.
STARK: Thank you.
DR.
IBBOTT: I want to make sure that we
give Dr. Mehta a chance to ask a question if he has one. Dr. Mehta, do you have any questions? He may not be able to hear me.
DR.
MEHTA: No, I don't have any questions.
DR.
IBBOTT: Thank you.
All
right. We are a few minutes ahead of
schedule at this point so we'll take a short break. Let's make it 10 minutes and we back at 10:50.
(Whereupon,
at 10:40 a.m. off the record until 10:55 a.m.)
DR.
IBBOTT: Take your seats, please. I'd like to continue the panel now if you
will take your seats, please. For those
of you who are like me are concerned, we are getting the heat turned down in
this room. At least in one sense.
We
will now proceed with the sponsor's presentation which will be introduced by
Dr. Kathy O'Shaughnessy who is Vice President of R2 Technology. Dr. O'Shaughnessy.
DR.
O'SHAUGHNESSY: Thank you very
much. Dr. Ibbott, we are very pleased
to be here today to present our image checker CT CAD software. I would like to introduce the attendees that
are here from R2 and some consultants that we have come to -- we have asked to
be here today to both present and answer questions from the panel.
Besides
myself from R2 Technology there's Dr. Castellino, our Chief Medical Officer;
Dr. Wood who is the head of our CT Products group; and Mr. Schneider who is the
lead algorithm architect that designed the algorithm that we are reviewing
today.
In
addition, we have asked the following people to join us. Dr. Delgado was a beta user of the system so
he can describe a little bit about his experience using the system at his
facility. Dr. MacMahon is a thoracic
radiologist from Chicago with extensive experience in both CAD and ROC
research. Mr. Miller is a
biostatistician for the study. Dr.
Stanford was one of the site investigators where we collected cases from one of
the sites.
Here
is a brief overview of our agenda.
After my introduction we'll go into the current clinical practice for
some background on lung CT and, in particular, the detection and management of
nodules and lung CT images. Then we'll
describe the device both in terms of how it works and how the user uses
it.
The
clinical study will start first with how we collected the cases that were used
and then go into detail into the methods and results from the clinical
study. After that we'll have a brief
discussion, presentation about the beta test that describes a little bit about
the usability of the system. And I'll
finally summarize.
Before
we move into the presentation, I wanted to put out our proposed indications for
use of this device. I thought it was
important to go over this to sort of put what we are presenting today in
context. The image check for CT is a
computer-aided detection or CAD system designed to assist radiologists in the
detection of pulmonary nodules during review of multi-detector CT scans of the
chest.
It's
intended to be used as a second reader alerting the radiologist after his or
her initial reading of the scan to regions of interest that might have been
initially overlooked.
I
would like to ask Dr. MacMahon to come to the podium, please.
MR.
MacMAHON: Thank you. Again, I'm Heber MacMahon. I should say I have a small equity in R2
Technology. The company has also paid
my time and expenses for this meeting.
I
would just like to make some brief comments about the actual clinical practice
of radiology as it relates to thoracic CT scans and the importance of detection
of pulmonary nodules.
Some
of the common indications for performing thoracic CT scans would include
characterization of an abnormal finding on a chest x-ray. In this situation an abnormality may have
been detected and the purpose of the CT scan would be to characterize it as possibly
a lung cancer. And in addition to
detect additional abnormalities that might be relevant such as metastatic
nodules.
We
also used thoracic CT scans extensively for staging and monitoring lung cancer
and other kinds of tumors. In this
situation we are looking not only for pulmonary nodules, but also for enlarged
mediastinal lymph nodes and upper abdominal abnormalities.
In
the case of extra-thoracic tumors we are commonly also looking for pulmonary
modules and for enlarged lymph nodes in the mediastinum. Then there are a range of other applications
of thoracic CT some of which are developing and will be used more extensively
such as detection of pulmonary embolism.
However, in all these
situations, although the pulmonary nodules are not the primary focus of the
examination, there is an opportunity to detect pulmonary nodules that may be
present in the lungs of these patients.
Finally,
lung cancer screening which is investigational and depending on the outcome of
the ongoing NLST study may be used more widely. And, of course, in lung cancer screening pulmonary nodules are
the main focus of the investigation.
But
the point I would make is that lung nodule detection is a requirement in every
chest CT scan no matter what the original clinical implication. Only when the radiologist has detected a
nodule can he or she decide what course of action is then appropriate.
There
are various management strategies that can be used to manage a pulmonary
nodule. In order to determine whether
it's an actionable nodule, we need to consider the size. Generally larger nodules are more dangerous
and more likely to be cancerous.
We
consider the shape whether it's spiculated, ground glass, and so forth, whether
there's been integral change from a previous examination in the same
institution and that would be part of the normal diagnostic process to make
that comparison. We would consider, of
course, the clinical context, the age and gender of the patient, smoking history,
and so forth. There are a number of
factors that play into that decision in addition to the image itself.
If
the nodule is considered actionable, we can recommend a number of courses of
action. One of the most common would be
to obtain outside prior imaging studies from other institutions. If we can establish stability over a period
of time, no further action may be necessary.
Follow-up
CT scan might be prudent at anything from three months to 12 months depending
on the nature of the nodule and the radiologist level of suspicion. Other kinds of imaging studies such as a PET
scan may be applicable, especially in larger nodules that are in the range of 8
to 10 millimeters. This may distinguish
cancer from a benign nodule,
Finally, we can consider biopsy, either
transthoracic needle biopsy, bronchoscopy, or thoracoscopic resection.
Just
to illustrate the clinical problem, here is an example of a very small
pulmonary nodule which I think might easily be overlooked in clinical
practice. It's almost indistinguishable
on the single section from surrounding blood vessels but this is, in fact, a
small lung cancer which was detected one year later, as you can see, at which
time it is much more advanced.
So
this is a very challenging problem for radiologists to visually attack these
very small nodules and CT scans. We are
aware that we do miss nodules and I'll just cite two particular studies of
interest that have addressed this issue of missed nodules and CT scans.
Dr.
Hartman and others at the Mayo Clinic looked at over 1,000 screening CT scans
and compared them with prior screening CT scans one year earlier to see how
many nodules may have been overlooked.
They found that as many as 24 percent of the prior prevalent scans had
nodules that were not recorded at that time.
This
might seem an astonishingly large number but this is consistent with some other
studies. Now, a large number of these
nodules were relatively small put more than one-third of them were about three
millimeters and in the size range where they are likely to be considered
actionable.
And,
in fact, 6 percent of them had grown which would mean that they were highly
suspicious for lung cancers so there seems little doubt that nodules are being
missed even in excellent centers such as the mayo clinic in a study that was
focusing specifically on the detection of nodules.
One
other study performed by Gruden and others at Emory University looked at 25
patients with presumed lung metastases.
These patients had soft tissue sarcomas and melanoma and they established
truth by consensus which is a practical method using five readers. These nodules were three to nine millimeters
in size and they were solid nodules.
Two to nine solid nodules in each case by consensus.
They
found that the miss rate for individual readers ranged from 20 percent to 39
percent of all of the nodules in this size range. This was in an observer test setting where the readers were
focused on detecting nodules and presumably had no other task in mind so one
would expect a relatively good performance in that situation.
So
between these two studies we can see that there is a considerable problem with
oversight errors in reading CT scans.
Now we have a trend towards thinner CT sections with the newer
multi-detector scanners. This allows improved
ability to detect and characterize lesions.
It does allow us to do a high quality off-axis reconstructions.
On
the other hand, it does present us with more image data, more opportunities for
error. In a chest CT scan performed
with a multi-detector unit we may have anything from 18 to almost 300 images of
the chest and the radiologist has to interpret those visually.
I
think that the evidence that we've seen strongly suggest that traditional
visual interpretation is no longer sufficiently reliable for detecting these
very small and potentially dangerous common nodules.
At
this point I would like to introduce Ronald Castellino, Chief Medical Officer
for R2 Technology.
DR.
CASTELLINO: Thank you. My name is Ron Castellino. I'm also a diagnostic radiologist but
currently I'm the Chief Medical Officer of R2 technology.
At
the outset I'd like to particularly emphasize the definition of computer-aided
detection which is also called CAD as we will be using it in the presentation
today. Computer-aided detection as we
use it refers to the availability of computer algorithms that automatically
identify regions of interest on a medical image for the radiologist to
evaluate.
It's
purpose, of course, would be to decrease what I would term observational
oversights. That is, findings that are
present on the image but, in fact, are not seen by the radiologist. This is not a device to tease apart very
unusual nodules that might not be present or barely present on the image. These nodules are actually clearly visible
on the image.
The
image check for CT CAD system specifically is designed to automatically detect
regions of interest with features suggestive of solid pulmonary nodules on CT
exams of the chest. It's important to
remember that it is to be used as a supplemental review. That is, after the initial assessment has
been made by the radiologist. It is not
a first reader.
The
radiologist, most importantly, remains responsible for the final interpretation
of the findings that the CAD marks may put on the image. That is, to determine if the mark is
actually a true mark or if it is a false mark.
A
brief review of the device description.
The CT scan is performed in the standard fashion. The images or the data set is moved to
increasingly types of work stations that radiologists review the images on and
what is what we call a soft copy display.
These images may be reviewed slice by slice but increasingly they are
reviewed in some type of a melt-through or a cine mode to facilitate reviewing
these hundreds of images that are generated.
By
the same DICOM standard the data set can also go through a server computer. Various image analysis algorithms can be put
into place. In this case, I point out
segmentation. This type of information
can also be transmitted to the work station to help the radiologist further
analyze the images and this is an image checker CT work station which was
cleared by the FDA in 2002. This is an
existing product that has been cleared.
The
same DICOM data set can also go through an image checker CT CAD software system
and provide on the work station CAD information as well. It is this specific piece of the product
that is under review today by the panel.
I'll
show you a few screen capture images of the front end of the work station on
which the CAD marks are displayed. The
view port on the right is familiar to radiologists. This is where we can see the axial images. I guess I can't use this thing. Thank you.
We are a high-tech business as you can see.
There
we go. On the large view port on the
right we can see the axial image displayed to the radiologist which is viewed
either singularly or, like I said, melt-through a cine mode. The smaller view port on the upper left is a
three-dimensional reconstruction of the contents of the lung.
You
can see the pulmonary vessels. In fact,
a few nodules perhaps you can see there. And the horizontal lines simply indicates to the radiologist what
level on the image the axial image is displayed. We see a nodule here quite clearly in the right apex.
The
radiologist then will move down the entire sequence of the lung in the lung windows
looking for other abnormalities, nodules as well as a multitude of other
features that the radiologist searches for sometimes seeing nodules and
sometimes not seeing nodules.
When
they completely review the entire study, which I'm giving to you in a very
schematic fashion here, the radiologist then will activate with a mouse click
the CAD button we call the R2 button.
At that point in time the CAD process takes over and presents the
following.
The
circles indicate candidate nodules that the CAD system has identified shown to
the radiologist on the three-dimensional display of the lungs, as well as
brings the radiologist automatically to that specific site where the nodule is
best seen by the CAD system.
In
addition, out other view port on the lower left is shown. This is a three-dimensional reconstruction
that can be rotated to separate the nodule out from adjacent vasculature. I would like to emphasize that upon the CAD
review the radiologist need not go through the entire data set once again but
simply by moving and hitting one of these little buttons here with a mouse
click which you can't read here. It
automatically jumps the image. By the
way, the size is automatically shown as well.
It
automatically jumps the image to the next CAD detected nodule and the next and
so forth. For example, this nodule, as
I showed you and, for example, a nodule at the right base which is clearly a
nodule but, in this case, had been overlooked by the radiologist on the set of
images.
That
is the CAD display on the work station.
What does the CAD search for? It
is specifically designed to search for solid lung nodules that are 4 mm. or
greater in size and we find that further as follows. They should have an approximate spherical shape.
The
margins can be smooth, lobulated or spiculated and should have soft tissue
density which we define as having average density of minus 100 Hounsfield units
or greater. Some of the typical CAD
marks you've seen already. They circle
the nodule. We consider this a true
mark if it actually encompasses the size of the nodule sometimes quite small,
moderate in size.
I
would like to emphasize that also although we look for spherical nodules if, in
fact, the nodule is adjacent to a plural surface where a portion of the sphere
is obliterated by contract with the plural surface. The algorithm tries to find these as well.
Secondly,
this image perhaps some of you can see, although it is easier for the
radiologist and the CAD system to detect a nodule that is surrounded by
completely normally aerated lung, if there is adjacent modest non-aerated lung
as we see here in the appended edema, the CAD algorithm often is successful in
teasing out the nodule as well.
There
are a multitude of other parenchymal abnormalities within the lung tissue that
the CAD algorithm does not search for.
The radiologist must look for these but the CAD algorithm does not
search for. For example, linear strands
which do not fit the criteria. I would
like to point out importantly although this fits the criteria of being a
spherical nodule, we call these ground glass opacities.
They
are increasingly noted to be of importance, particularly for lung cancer
screening programs that because of the Hounsfield density cutoff that we have,
this type of nodule currently is not searched for with our set of algorithms.
All
CAD systems have false marks. We see a
few here such as this one here where a branching vessel exist. The CAD algorithm thought this was a nodule
and marked it incorrectly. Plural tags
are at times marked incorrectly. I can
tell you that our experience internally as well as with users indicate that the
vast majority of these false marks can be readily dismissed as you see here.
As
an aside, we have found that a regulatory database a median of three false
marks per exam. I would like to
emphasize this is per exam. There is a
median of 160 images per exam so we're talking about approximately one false
positive mark for every 50 to 55 individual images.
Now,
the clinical study was designed around an ROC study as you've heard from Dr.
Wagner. It was done in close
collaboration and support with the people from the FDA. The ROC study in a large extent does measure
-- a combined measure of efficacy of safety.
There is some discussion about that and Dave Miller will fill you in on
that as we see it, at least.
There
are three parts. We've collected
cases. I'll review that. These cases were sent to a reference truth
panel and finally to the MRMC ROC study which you'll hear about from Dave
Miller.
I
would like to spend only a brief comment upon the target of nodules. You've heard from Dr. MacMahon that we are
increasingly seeing smaller nodules on our CT scans and our clinical practice. We wanted to design a CAD system to help
radiologist detect all solid nodules between 4 and 30 mm. That was the focus of our research effort.
And,
as you are well aware, those in the clinical practice you will recognize that
most lung nodules most of the time are typically sampled by biopsy or thoracic
resection if they are 8 or 10 mm. or so greater in size. There are obviously exceptions to this but,
in general, they are.
The
availability of a biopsy proven so-called gold standard to evaluate nodules in
this smaller size range was just not available to us. We settled on a gold reference standard of a consensus on
actionability as being the only practical standard that would capture all solid
nodules of clinical concern in this size range. We are really focusing and trying to help the radiologist in the
4 to 8, 10 to 12 mm. range. The larger
nodules, of course, radiologist will almost always see.
We
collected cases from five centers. They
contributed consecutive non-selected cases.
We tried to make this as representative as possible. They were all in adults. They were performed for a variety of
clinical indications. There were no screening studies in this group.
Cases
with greater than 10 nodules were excluded.
We felt that there were a multiplicity of nodules. The issues of searching for nodule where the
radiologist has already seen 8, 10, 12, 15 would be reported. The images, of course, have to reach certain
technical parameters.
These
cases were divided into two categories to begin with by report. The nodule-present cases had in the report
the presence of one nodule or more described by the reviewing radiologist. These patients by definition had a history
of biopsy proven documentary cancer either primary to the lung or in an extra
thoracic site.
We
did this to try to increase the likelihood that nodules in this group might
have clinical significance because they were in patients with cancer but I
would like to point out that the specific nodules themselves were not biopsy
proven. The nodule absent cases, once
again by report, no nodules were described within the context of the
report. These patients could have a
history of cancer or not.
The
final truth was determined by the reference panel which you'll hear about from
Mr. Miller. Five sites contributed to
the study. Three of these are community
imaging centers, two are university centers.
They were from the east coast, mid-west, west coast. There were 63 cases that had nodule-present
by report, 88 nodule absent by report.
You
can see the distribution between male and females were similar. The age range was similar in the two
groups. There was a slight increase in
median age in the nodule-present cases perhaps because they all had documented
histories of cancer as compared to this group. The type of cancer in the nodule-present case, 38 percent had a
documented primary lung cancer and 62 percent had documented extra-thoracic
primaries.
Here
are some of the parameters of the technical aspects of the case
characteristics, the median number of slices you see here. There is a slight predominance of thinner
slice sections in the nodule absent cases mainly because one of the centers was
doing much thinner slices routinely and they contributed a larger amount of
nodule absent cases.
The CT vendor's use in these five sites were
General Electric or Toshiba.
I
would like to ask Dave Miller to present the methods and the results of the
study.
MR.
MILLER: Thank you. My name is Dave Miller and I am currently
the Director of Statistical Analysis at Ovation Research Group. At the time that this study was conducted I
was the Director of Biostatistics at R2 Technology. R2 is paying for my time and travel. However, I do not have any financial interest in R2 Technology.
Just
want to quickly go through an outline of what I'm going to discuss because I'll
be up here for a little while. I'm
going to go through some definitions that I'll be using during the talk. Then I'll talk about the reference truth
panel. I'll talk about the ROC study
design, our primary analysis. Then we
did a large set of robustness analyses.
Then finally the study conclusions.
So
gold standard, and these are definitions that I'm going to use. They are not necessarily dictionary
definitions of these but gold standard is something that I'll define as an
objective and definite measure of truth.
The
reference truth is a truth standard for a subjective construct. It is a term that is fairly widely used and
it's a term that I'll be using here as a standard that's used in lieu of an
available gold standard. The kind of
thing that reference truths are used for are things like actionability where
actionability is something I'm defining as a subjective point-of-care decision
which is really what we're targeting with actionable nodules.
Nodule
also is a subjective definition. It's a
subjective characterization of a lung abnormality. Finally, a panel is a group of radiologists with a given task. In this case, their task was to identify and
characterize actionable nodules. Consensus is a term I'll use only for unanimous agreements. When you hear we use consensus, that means
unanimous agreement as opposed to majority agreement.
Then,
finally, a few study definitions. I'll
run through these very quickly because you've got a very nice tutorial from Bob
Wagner this morning. The ROC curve is
the receiver operating characteristics curve.
AZ is the area under the ROC curve, the measure of interest in the
study.
MRMC
stands for multi-reader, multi-case.
I'll use the term primary analysis for our protocol specified primary
analysis and the term ANOVA-after-jackknife.
The ANOVA there is analysis of variance and you've got a nice
description of both the jackknife and the bootstrap earlier.
So
under the reference truth panel the goal of the reference truth panel was to
fully identify all nodules in the case sets.
These are the cases that Ron described how they were collected. We wanted them to rate the actionability of
any nodules that they found.
Specifically we are defining actionable as a nodule that requires
surveillance or intervention so it could be follow-up or it could be more of an
intervention.
We
define the reference truth so that we could use it in the ROC study. The method was to have a panel of three
radiologists independent review the cases and we followed a two-path process to
reduce observational oversights.
The
reference truth panel qualifications were that they needed to be board
certified radiologist, that they had at least six months of reading thin slice
which we defined as less than or equal to 3 mm. collimation CT of the chest,
and they needed to have experience with reading soft copy.
A
total of 11 panelists participated in at least one of the three-member panels
that were convened. Just to be clear,
we didn't have a single three-member panel because it just would have taken
weeks for three people to review the set of cases that we had. We had a succession of panels and there were
a total of 11 different panelists that participated in at least one of those
panels.
Nobody
participated in more than three and obviously nobody participated in less than
one. This is how the panels
worked. We brought the radiologists in
and we put them in three different rooms.
This is after a brief sort of training that we gave them prior to going
to the three different rooms. They had
three different work stations set up and they each independently reviewed a set
of cases. In a typical sessions we had
about 20 cases reviewed.
After
they had reviewed all of the cases for a given day, and this usually took maybe
four or six hour or so, we took the computer files of all of their findings and
these are findings of the exact locations and we brought them together to get
the union of all findings so that redundant findings were captured and we knew
every finding that any panelist had found.
This
is a little hard to see up there but we also at this stage excluded nodules
that were less than 4 mm. in size or greater than 30 mm. in size. Those were protocol exclusions and we had
asked the radiologists not to spend too much time taking precise measurements
as they were doing this.
After
this there were 95 findings where three our of three of the panelists agreed
that it was a consensus actionable nodule.
I couldn't say consensus. Three
out of three agreed and, thus, there was a consensus that it was an actionable
nodule.
Now,
there was also a large set where there was disagreement. Either one out of three or two out of three
of the radiologist had identified the finding and the other radiologist either
had overlooked the finding or didn't feel that it was an actionable
nodule. These went to a second
pass.
The
way the second pass worked is that after about half hour of prep or so they
went back into their individual rooms so they didn't come together and talk
about the cases. They each went back to
their individual rooms and they had the locations of each of these disagreement
findings identified for them. So the
second pass went fairly quickly because they didn't need to go through the
whole case. They were just looking at
and being directed to specific spots and being asked to rate the actionability.
After
this there were 47 additional nodules that went into our truth set of unanimous
nodules. There was also a fair number
that went into what we call the majority group, that two out of three felt that
it was actionable, and a minority group that one out of three felt that it was
actionable.
Our
primary analysis focuses on consensus agreement but we did do some robustness
analyses around the majority and minority.
I'll be talking about that later but for now I'm focused on the
unanimous nodules.
So
as a result of this process the eight three-radiologists panels. I told you there was a series of panels. There were, in fact, eight of them. They identified 142 consensus nodules in 65
nodule present cases. You might notice
that number 65 is slightly different than the 63 number that you saw
earlier. That's because now our
consensus panel is the definition of truth for this study.
You
can see the size of these findings. The
median size was 7.9 mm. and there were a lot of them that were in the 5, 6, 7
millimeter range. The remaining 86
cases were categorized as nodule absent by virtue of not having any of the
unanimous nodules in them.
So
moving onto the MRMC ROC study, the objective of this study per protocol was to
demonstrate that review of CAD output improves performance of radiologists
reviewing MDCT with respect to their ability to accurately identify actionable
nodules.
Our
outcome measures were AzB. That is, the
before CAD area under the curve, AzA, that is the after CAD, the area under the
curve and, most importantly, Azdelta.
This is basically the difference between the two curves. And the hypothesis in a formal statistical
sense -- the null hypothesis was that the mean change in the area under the
curve was zero and the alternative hypothesis, of course, is that Azdelta is
greater than zero meaning the CAD did have a benefit.
The
study was conducted in two phases. We
first did a 32-patient study and then after doing that study we had some
discussions with FDA and we outlined what would be the appropriate methodology
to use for a second study, what the appropriate size for the second study would
be based on the type of methodology that was suggested. So I'm going to be talking about that second
90-case study as the focus of this talk.
The
reader qualifications for the ROC study, so this is, again, new set of
readers. Don't confuse them with
reference truth panel. Completely
different people. It would be wrong to
have the same people. These people had
reader qualifications that they be board-certified radiologists and have at
least three months of reading MDCT of the chest.
The
basics of the study is that we have 15 readers read all cases. We had 90 cases. Of the 90 cases 48 had at least one actionable nodule and 42 did
not have any actionable nodules and that was based on a stratified random
sample of our complete set of cases.
There
were, of course, four quadrants per case by definition but the important point
is that these quadrants, all four of them, were rated pre-CAD and then
sequentially post-CAD. The ratings were
finally evaluated against the reference truth so the ROC curves were drawn by
comparing the ratings which were on a continuous scale to the reference truth
established by the panel.
I
want to clarify what the unit of analysis is because I know people have a
tendency to want to sort of track the numbers as they go through the slides and
see where things add up so, just to be clear, nodules were the unit of analysis
for the reference truth. The reference
panel was supposed to identify every nodule.
Quadrants
-- the quadrant truth was computed from the nodule truth. For instance, if there was a quadrant that
had one actionable nodule and one non-actionable nodule, the quadrant was,
nonetheless, considered nodule-present quadrant because it had at least one.
On
the other hand, if there was a quadrant that had a minority nodule in it, in
other words, a nodule that at least one person on the panel thought was a
nodule but not unanimous, that was considered a nodule absent quadrant. Every quadrant counted in every analysis
that we did.
Now,
the reason that we went with this quadrant approach is that the LROC methods
were not developed at the time that we embarked on this for multi-read,
multi-case studies. I think they
probably will be in time and they may even be right now but at the time we
began the study, they were not.
Bob
Wagner described it a little bit as these being sort of competing fields that
people that went with the ROI approach versus the people that go with the full
localization. I think really there are
two camps that are going after the same thing of trying to get some measure of
localization added to the ROC method.
We
felt that for this particular case where you might have a nodule that was quite
large in one lung and then a smaller nodule in a contralateral, that that
smaller nodule in some cases might be the really important one that actually
drove the care. We felt that getting at
localization in some way was important.
We went with the quadrant approach.
The
quadrants were rated by the ROC readers but then the case, not the quadrant, is
the unit of analysis for the computation of the p-values and the confidence
intervals based on the jackknife and the bootstrap. You heard these references mentioned earlier but Obuchowski
specifically is the reference for using this region of interest or quadrant
approach. Carolyn Rutter is the person
that developed the method of using the bootstrap to sample cases.
The
reading environment for our study is that readers were trained on work station
use and we really tried to create a reading environment that was as similar to
their individual practices as possible.
So the usual work
station controls were available to them.
If any individual reader had a particular window or leveling
preferences, they were allowed to modify that.
We didn't have it in the protocol that they had to read a particular way
that would take them out of their reading environment.
They were allowed to practice on three cases with
the trainer present. The ambient
lighting was adjusted to the radiologist preference. There was no hard time limit.
The
instructions given to the readers was to only search for 4 to 30 mm. actionable
solid nodules, to rate each case post-CAD immediately after the pre-CAD rating
so they had to go through the entire case pre-CAD and provide the ratings
before the computer would even allow them to turn on CAD and then provide the
post-CAD ratings.
They
were instructed to consider age, gender, and clinical indication. These were taken from the radiology
report. We did not provide them with
the full radiology report as that obviously would have provided too much
information for them to be able to make up their own decisions.
So
the basic study work flow here -- let's see which of these works. Yeah, this one works. When you saw the work station earlier, there
was no blue line. The blue line is
separating the upper quadrant from the lower quadrant. We didn't feel like we needed a line to
separate left and right. The yellow
line is indicating where they are in the exam.
As
they were reading the case, they had the opportunity to bring up a pop-up menu
to rate the quadrants at which point they would get this little cartoon of
sorts with these slider bars. They
would move the slider bars either all the way over -- you can't see. There's a little 100 there -- to indicate complete
confidence that there was at least one actionable solid nodule present in the
quadrant, or zero to indicate complete confidence that there were none.
In
this particular case you can see that the reader has gone through and given a
pretty low confidence or, I should say, a high confidence that there are no
nodules present in any of the quadrants.
Having
done that they then have the opportunity to click this button up here and turn
on CAD. It's a little bit hard to see
here but there is a potential nodule.
I'm not a radiologist. I won't
tell you whether it is a nodule but it is located there in the upper right
quadrant. Then they would have the
opportunity to rate the case again.
In
this case they might have changed their rating. In the other quadrant since there was only a mark in the upper
right-hand quadrant, it's fairly unlikely that they would have changed any of
their other ratings but they were allowed to.
So
after doing this with our 15 readers who each read the 90 cases, both pre-CAD
and post-CAD, were able to draw the ROC curves for each of the individual
readers. This is just an example of a
single reader and so the area under the dash line is the pre-CAD Az and the
area under the blue line is the post-CAD Az and then the area in between the
lines is the Azdelta.
These
are the 15 pairs of readings. I didn't
produce this plot specifically to answer some of the questions that came up
earlier this morning but I think it might answer some of them a little bit. Now, this is not the same plot that you saw
earlier. This has the pre-CAD area under the curve on the bottom and
the post-CAD area under the curve going on the Y axis. So pre-CAD the range was from about .82 up
to .96. That's the range of the 15
readers area under the curve. Post-CAD
the low end was .86 to .96 so you can see a narrowing of the range post CAD
with respect to Az.
In
particular, these three readers who had
-- I'm trying to look for a different word than
worst -- had the worst pre-CAD Az performance of around .82 to .84 were the
ones that improved the most, or were among those who improved the most. You might wonder what about readers that did
pretty well. Well, these two readers
did very well pre-CAD, at least, measured against Az. And post-CAD they also had some improvement. It was a more modest improvement. They didn't have as much to improve.
Now,
finally, there's this reader up here.
This reader had a nearly perfect pre-CAD performance. This does just go to .96, not all the way to
1 so they weren't absolutely perfect.
What you worry about with a reader such as this is you don't want CAD to
cause them to change their impressions so they get worse and they did not.
So
moving onto the primary analysis this is the average reader ROC curve. Again, here is the pre-CAD line, the
post-CAD line, and the area in between is the Azdelta. I'm just going to focus in on this part right
here because it is an important point about whether or not the curves cross.
The
curves do not cross and so you can see that they are always apart. Especially in this area here I think is the
area where people are most likely to have their individual operating points,
although, as you saw, they might go all the way out here.
These
are the same 15 dots just plotted against a different axis so this is sort of
how far away they were from that line.
You can see individual reader improvements ranging from about .06 to
zero to no improvement. And then the
idea behind the Dorfman-Berbgaum-Metz ANOVA-after-jackknife analysis is to
create a confidence interval and computed p-value that would allow us to figure
out what might happen with a new reader with a new case.
I
mean, that's really the idea of this confidence interval is what kind of
performance would we expect from a new reader with a new case. You can see that both the individual readers
as well as the average delta and the confidence intervals are well on the side
of CAD better as opposed to the side of CAD worse.
Now,
we went ahead and did a number of robustness analyses and these were basically
about repeating the primary analysis varying different assumptions to
demonstrate that the primary results are not sensitive to study design. I think these are very, very important
because there is a considerable literature that you can tweak different things
and end up with different results. If
we had found that, we would have been in a difficult position because we
wouldn't have known whether or not we really did have a robust result.
I'm
going to talk about this with reference to the statistical methodology,
specifically the ANOVA approach versus the bootstrap approach. There are lots and lots of different
iterations on this but I'm just going to focus on these two. I'm going to talk about the reference
truth. I'll focus on the consensus
standard versus the majority standard but there are a number of other reference
truths that we examined and I'll just focus on those two.
And
then panel variability. I've talked about
the confidence interval being a way of getting at what would happen with a
future reader with a future case. What
you really want to know is what would happen with a future reader and a future
case evaluated against a new truth, right?
That
means that you don't just have to have the random reader and the random case
components of the ANOVA model. You also
have to have some way of evaluating your truth against the random panel if you
are going to fully capture the variability.
So
the ANOVA-after-Jackknife compared to the bootstrap, I'll run through this
quickly because you heard this earlier.
The ANOVA-after-Jackknife is based on leave one out samples. Again, the leave one out here is cases. A case is being left out of each sample as
opposed to a quadrant.
The
Az end of the curve has been computed for each reader case combination and then
analysis of variance random effects model is fit. This is the standard analysis of variance random effects model
with full interactions described by Dorfman-Berbaum-Metz.
The
bootstrap, I think nonstatisticians a lot of times find the bootstrap a little
bit more intuitive. The experiment is
replicated in 1,000 random samples so from our sample of readers in cases, we
generated random samples of readers in a random sample of cases and for each
sample we matched our random readers with the random cases and repeated the
entire analysis.
It
is very computationally intensive but it gives you a way of coming up with
confidence intervals that allow a nonparametric -- fully nonparametric approach
to evaluating what would happen with a future reader in a future case. I do want to point out that the
ANOVA-after-jackknife is semi-parametric.
The ANOVA piece is parametric but the jackknife piece is nonparametric.
So
these are the confidence intervals for the ANOVA versus the bootstrap. You can see that the confidence interval for
the ANOVA is a little bit tighter. For
the bootstrap it's a little bit broader.
One
of the things that the bootstrap is known for is being able to come up with
confidence intervals that are not actually symmetric about the mean because
often there is not really any reason to believe that the competence intervals
would be symmetric about the mean. In
this case you can see it actually goes out further on the CAD better side. Even though the competence interval is
wider, it does not in anyway diminish the results.
So
returning again to the primary analysis, the primary analysis, as I showed you
earlier, is based on a delta Az of .024 and a p-value of .003. I just showed you a different methodology
using the bootstrap and came up with .0246, very close, and a p-value of less
than .001.
Then
we went on to a different reference truth.
The different reference truth that I'm talking about here, and I
apologize that it's not on the slide.
We didn't want to make it too dense, but this different reference truth
is majority so this means that a quadrant would be considered nodule present if
there was at least one majority or consensus nodule and it would be considered
nodule absent if it did not have any majority nodules in it.
A
really important thing to point out here is that the majority quadrants, the
ones that two our of three radiologists in the panel consider to be
actionable. They are included in every
single analysis so that means that when we're talking about the unanimous
truth, they go in to the false positive side of things, as somebody calls
it.
On
the other hand, if we talk about this reference truth, they go into the true
positive side. We felt like we don't
know if those are nodules or not and so the most conservative approach to take
is to always put them in every analysis.
The
delta Az here is a little bit lower but the p-value is actually more
significant, to use a loaded term. This
has to do, I think, with this sample-sized paradox that Bob Wagner was
describing earlier. The final step was
to do the random reference truth.
We
did the random reference -- actually, before I go to that, I want to mention on
the different reference truths in addition to majority and consensus, we also
looked at a minority reference truth which is sort of the loosest possible
standard we could come up.
We
also did a tighter truth based on having a second panel of five people look at
the cases and define the truth more tightly.
In all four of those cases we came up with a similar statistically
significant result. So the random
reference truth is based on picking two panelists at random to review each
case.
Pretend
that the three-member panels didn't exist.
Redo the truth assuming that third person just wasn't there in their
room. When you bring together the
first-pass findings, their data doesn't come in. When you go to the second-pass it's only the two out of two
consensus. This allowed us to come up
with competence bounds that captured that piece of the variance. It ended up being fairly similar, although
the delta Az is somewhat diminished from that of the primary analysis.
So
all variations gave statistically significant results. I'm a statistician so that's what I know
best and that's why I'm best prepared to talk to you about. I take the point of some of the panelists
that -- by panelists here I'm referring to you all as opposed to any of our
other panelists.
You
want some sense of what does it all mean.
What does this Azdelta of .02 mean?
For myself, I find it useful to think about individual operating points. This is the pulled curve where we pull all
of the readers together. You can't
really translate this to a new reader and a new case.
These
are analyses that you don't do to find statistical significance or to get a
particular competence interval or particular estimate. There are analyses you do to try to
understand the data. There were
analyses that we put in our protocol that we would be doing but they were
secondary analyses just to try to get some sense of what's going on here.
So
this is the operating point of 20.
Recall that we have this 0 to 100 scale so 20 reflects sort of the most
aggressive end of the spectrum. We
could go all the way out to 0 but 0 is just all the way at that end. Twenty was an area where you could imagine a
fairly aggressive reader would say, "Even for a 20 I might want to do some
kind of follow-up." Fifty was indeterminant on our
scale so that is one operating point that is interesting to look at. Eighty would reflect sort of the least
aggressive reader. This is by no means
all readers. If I put this plot out
with all 15 of the readers, you get sort of that weird scatter plot similar to
what you saw earlier, but just to get a rough sense of what kinds of
improvements are maybe plausible
So
this dotted vertical line here is the line that corresponds to having the same
false positive fraction. This is saying
that if you started out at 50, your sensitivity could increase by this much
without sacrificing your false positive fraction at all. Not one iota. If you think of the false positive fraction as your measure of
safety and you think of the true positive fraction as your measure of efficacy,
that is saying you can go up and get efficacy without any safety tradeoff.
Now,
it's probably more likely that people are going to go a little bit up and over
so maybe they are going to call more things.
That's what we see with our individual rating. You can go up and over and still have the same positive predicted
value. Even though you are giving up a
little bit on the false positive fraction, you still have the same positive
predicted value.
This
50 here is still a little bit over from that so it's not exactly the same
positive predicted value but the basic point is that you can go up and over
without having a sacrifice or without having a substantial sacrifice.
So
these are the analyses that I mentioned.
They were in our protocol as analyses that we were going to do, but I
really am very sympathetic to what Bob Wagner said about these numbers. It's so hard to say what they mean. What are these numbers. I don't want anybody to run too far with
these numbers but I do feel like it's necessary, especially for people who
aren't statisticians, to want to understand what's going on with some of the
raw data.
If
we take 20 as the threshold for where somebody -- pretend that all readers
treat 20 as their criteria for actionability, then we would have had 16 percent
of the total nodules so there were 1, 125 positive quadrants that the 15
readers looked at. Sixteen percent of
those would correspond to misses. With
this very aggressive cutoff I think odds are those are, in fact, observational
oversights.
Post-CAD
that goes down to 11 percent so the 16 percent versus 11 percent, that's a 30
percent reduction in misses at that threshold.
Now, that is a very aggressive threshold. Probably most readers aren't at that threshold. Fifty might be closer to where most people
are at. It goes from 20 percent down to
16 percent. That's a 22 percent
reduction in misses.
Then
finally if we imagine that 80 is sort of a higher-end threshold of what might be
called a miss, there is still a 15 percent reduction in misses. Now, these numbers are presented without
confidence intervals, without p-values.
Take them with a grain of salt.
But in terms of understanding potentially the clinical importance, I
think that maybe this may satisfy some of the desire to see a different number
than just the delta Az.
I
also wanted to show you what happens if we look at the true positive fraction
and we look at the false positive fraction in a way that is probably more similar
to the way that a lot of academic studies are done where you look at the cases
where you are most likely to see an effect on the true positive side and you
look at the unambiguous nodule absent quadrants on the other side.
Here
I really am throwing out quadrants. As
a statistician I hate to throw out data but I'm throwing them out just to get a
clearer idea of what's going on here.
So if we are looking at the true positive fraction just for the smaller
nodules, and I'm just using -- they are not really small.
I
think a lot of people would define small as less than 4 or less than 3, but the
intermediate-size nodules as a proxy for difficult to find nodules or easily
overlooked nodules. Then you can see
that you get more of a rise in the curve without quite as much of a tradeoff
early on in terms of the false positive fraction. This is analysis that was not included in our protocol. It's just something that I added to try to
get a little bit more understanding of what is taking place here.
So
the study conclusions. Again, the study
conclusions go back to the primary analyses that we did and the robustness
analysis. The study conclusions are
that the imaging checker CT improves reader performance for the detection of
actionable nodules. That was our
objective and that's what we feel that we demonstrated. And specifically the results are robust to
the analytical methodology, to the choice of the reference truth.
Again,
it wasn't just looking at consensus and majority. We looked at minority, majority, consensus, and sort of a super
consensus. Then it is also robust to
the additional variation associated with selection of panelists. I described identifying two random
panelists. We also did it with a single
random panelist, with three random panelists and came up with very similar
results.
With
that, I'll turn it over to Dr. Delgado. Thank you.
DR.
DELGADO: Thank you and good
morning. I am Dr. Pablo Delgado. I'm clinical associate professor of
radiology at the University of Missouri, Kansas City. I also practice at St. Luke's Hospital. I'm here to describe the beta experience that we're involved
with.
First
of all, I'll tell you a little bit about where I practice in the setting, where
the beta site was performed. I am a
private institution affiliated with the university. We have a hospital setting as well as an affiliated imaging
center adjacent to us. We practice with
residents available and we have an on-site residency training program of which
I am the program director.
Our
patient base is quite varied and I think rather common place for the
region. It's a typical mid-west
community base of private as well as community patients. Our CT equipment for our radiology
department, we currently have two four-channel multi-detector CT scanners which
happen to be GE QXI light speed scanners, although I don't think that's of
importance to this device as long as it's DICOM data and meets the collimation
thickness.
We
currently perform anywhere between 20 and 30 CT studies a day of the chest and
these different diagnostic indications including CT pulmonary angiography, high
resolution CT of the chest, detection of other lung diseases, as well as
multi-organ disease workups.
The
beta study that we performed was between the times of June and August of 2003
for a total of eight weeks. We
processed numerous studies. However,
the goal of the study that we agreed upon and embarked upon was to assess the
functionality of this image checker, CAD software, and how we would work with it
to answer the R2 developmental group questions about radiologist preferred
reading practices as well as work flow issues of how this would be incorporated
into our practice. And to determine
future applications of training needs in training radiologists in how to use
this device. It should be noted that we
were not asked to assess the clinical effectiveness of the CAD system.
The
design of the system involved retrospective review of CT chest cases from our
institution from previous months that have already been acquired and already
been interpreted outside of the study and that met the collimation thickness
which, I think, was already mentioned, 3 mm. or less and were contiguous slices
of the chest.
The
cases were read by faculty radiologists as well as residents so we got feedback
from both experienced radiologist as well as radiologist in present training.
For
the training of utilizing the device, we had an R2 application specialize on
site for an entire day who got to work with most if the radiologists. A few that were not available for that time
were given the training subsequently by those who experienced the training from
the application specialist. That
training process involved the description of the CAD algorithm, what indeed it
does and what it doesn't with the review manual.
We
also reviewed several institutional cases.
First R2 had some cases of their own.
Then we through the DICOM hookup were able to push some of our cases to
the R2 device and process them so they were our cases. We also performed shadowing of retrospective
reading sessions where the radiologists were able to work with the CAD device
and subsequently ask questions if they felt that they were necessary or
encountered any questions.
Our
observations from using the beta product demonstrated that most radiologists,
in fact all, demonstrated a rather rapid learning curve for using the CAD
device. In a rather short period of
time most people felt very comfortable in utilizing the product as is intended.
We
encountered no specific technical errors or malfunctions. We had no difficulties. We did, indeed, use it in the way it was
intended and we asked radiologists to first look at the case in a soft copy
reading mode and then subsequently push the CAD button and activate it and then
review it immediately thereafter. We
found that all radiologists missed nodules that were detected by the CAD.
There
certainly are false CAD positive marks as Dr. Castellino pointed out. However, most of these are easily dismissed
by radiologists and that includes both faculty and residents.
Of
course, I would agree with the comments made by other panel -- excuse me, other
presenters from R2 that we feel that radiologists definitely should review all
images initially without CAD and then a subsequent read with CAD. The reason for this is that CAD is not
really made to detect every single nodule and, No. 2, the algorithm is such
that it does not detect every single lung abnormality and radiologists are
still responsible for detecting any lung abnormality.
In
conclusion, I think that this product is very timely in what radiologists are
facing on a daily basis. The
development of multi-detector CT has led to an explosion, if you will, or
significant increase in the number of images that are very detailed and
radiologists are asked to interpret.
Numerous
published studies have already documented there are limitations in
radiologists' ability to detect lung nodules.
I believe the detection really is the limiting factor of eventually determining
actionability whether it is related to further diagnostic or therapeutic or
interventional workups. We found CAD to
me an effective tool in assisting the radiologist in the detection of lung
nodules with multi-detector CT.
I
will now reintroduce Dr. O'Shaughnessy of R2 Technology.
DR.
O'SHAUGHNESSY: Thank you very
much. I just have a couple of summary
slides kind of to bring it all together at the end. I just wanted to reiterate the main conclusion from our clinical
study for multi-detector CT exams of the chest, that the image checker CT CAD
software system significantly at a p-value of .003 improves radiologist ROC
performance for detecting solid pulmonary nodules between 4 and 30 millimeters
in size.
And
as both Mr. Miller and Dr. Castellino talked about and Dr. Wagner this morning,
we feel that is a good measure for -- a reasonable measure for evaluating both
a safety and efficacy aspect of the product.
Also from the safety aspect, the product is intended to be used as an
adjunctive device and with appropriate training we don't think there are any
issues there.
Just
to summarize, I'll put up again the same slides of the proposed indications for
use. We thank you very much for your
attention.
DR.
IBBOTT: Thank you, Dr. O'Shaughnessy.
We
are going to have time this afternoon for detailed discussion of this
presentation but let's take a few minutes now to see if there are any questions
for the previous speakers or clarification that's needed.
DR.
STARK: I have a few questions. Other panelist, please jump in. Dr. O'Shaughnessy, thank you. By the way, it was a fabulous
presentation.
DR.
O'SHAUGHNESSY: Thank you.
DR.
STARK: Very interesting subject and I
think everyone is interested in seeing this technology succeed. Certainly I am so forgive me. Some of my questions are, I guess, by nature
going to be -- are intended to be challenging.
Mr.
Miller talked about, as the panel did, what the word significant -- he used the
term significance is a very loaded term.
Later on when we discuss the marketing materials and things like that,
I'm worried about the pressures on radiologists to buy and use a technology and
want to shift the significance to what really is clinically significant. In
your presentation you pointed out -- I believe several of your experts pointed
out that the real clinical problem is that we're missing about 24 percent of
nodules or we are missing nodules at a significant rate. I think it was something like 24 percent or
something, perhaps you can refresh me, were seen in retrospect.
One
significant figure of merit here would be what fraction of those nodules that
are missed, that 24 percent that are detectable in retrospect, are now detected
with this technology given that the technology by itself has a sensitivity of
about 50 percent for detecting majority and unanimous nodules and a 50 percent
detection rate? I'm just asking. It's very, very low.
That
would suggest to me that at best the technology is going to reduce that 24
percent missed rate to about a 12 percent missed rate at the cost of generating
100 percent false positives and then having a radiologist groom through and
sort all this out by basically being said, "Do it again."
I'm
wonder if we had a placebo in this FDA trial of, "Radiologist, just do it
again, " or, "Here is the sugar pill. Just read it again," would we achieve the same presumptive
50 percent improvement in finding half of the lesions we know the current
standard of care is to miss?
DR.
O'SHAUGHNESSY: Right. I would like to answer that sort of in two
parts. The first part I would like Dr.
Miller to go over what we measured in our study and then have Dr. Castellino
talk about translating that to the clinical environment if that's okay.
MR.
MILLER: I guess there were a number of
questions there. Is there one you would
like for me to start out with?
DR.
STARK: I think you will do a great job.
MR.
MILLER: Okay. So the analyses that I showed at the end with the percent
reduction in misses are sort of approximated percent reduction in misses where
an attempt to get at that very issue. I
suppose that it is to some degree your job and, to some degree, our job to
determine what is clinically significant.
Now,
the numbers that I showed you were sort of in the range of a percent reduction
in misses of somewhere close to 20 percent.
Actually more like 20 percent on the low end. That is similar to what the experience has been with CAD for
mammography.
For
CAD in mammography the percent reduction in misses has been in that range. I think if you are a person that's affected
-- I guess I'm drifting off from statistics here. I should have handed it over to a clinician but, I mean, my hunch
is that is a number that would be meaningful.
As
far as the stand-alone sensitivity, I do want to sort of bring us back to the
fact that we evaluated two modalities here.
The two modalities that we evaluated were the readers stand-alone performance
and the reader plus CAD. The whole MRMC
framework is developed around those particular modalities.
CAD
as a stand-alone modality is not something that anybody is recommending that
people use. Therefore, those
stand-alone numbers, I think, are less valuable but are more valuable if they
pick up some of the more important things.
Also
I think some of those things in the 4 to 10 millimeter range that readers react
to and say, "Oh, I missed that.
I'm glad CAD pointed out."
It's more about what did CAD find than it is about exactly what the
percentage is.
DR.
STARK: Did you answer the core question
of if the radiologist right now standard of care I would suggest, and
clinicians can debate this, is that we miss a quarter of the lesions that are
actually there in retrospect. If we can
accept that as a statement, then as you design the experiment, what data are
there to suggest we would cut that miss rate and by how much?
MR.
MILLER: Will you permit me to go back
to the slide? Sorry. I'll get there soon. Okay.
This, again, is presented as an analyses that was specified in the
protocol that we would do, but you don't have competence intervals there so
these are numbers that you would want to put competence intervals on if you
were going to put a lot of weight behind them.
Also,
they make the presumption that readers all read with the same threshold cutoff
and we know that's not the case. At a
threshold cutoff of 50, let's focus on 50 for just a second, there were 228
missed quadrants. In other words, out
of the total number of quadrants that the radiologist looked at, 75 positive
quadrants times 15 so there are 1,125 times that one of the readers looked at a
positive quadrant.
They
gave a rating less than fifty 20 percent of the time. That is actually kind of a nice number because that number is not
radically different from I think what we see in the literature. It may be a little bit lower. I think there's a little bit of a relaxed
environment in the readings that they may be a little bit more likely to
identify things. But 20 percent of the
quadrant something is missed.
Post-CAD
it goes to 16 percent so that's a 22 percent reduction in the misses. That is, I think, the number that is closest
to answering the question that you raised.
Is that correct?
DR.
STARK: I think so. Let me see if I understand it and then I'll
ask you about the affect on this analysis of the quadrant versus the lesion
methodology.
MR.
MILLER: Okay.
DR.
STARK: I think that prejudice thinks in
favor of the technology. I'm not
sure. So you're saying if the standard
of care currently is to miss a quarter of lesions, then of that 25 percent
we'll miss one-fifth less so now we'll miss 20 percent of the lesions.
MR.
MILLER: Yes. Their miss is defined loosely as you are not actioning a nodule
that a consensus panel believes should be actioned. I don't think that they are actually missing it in every
case. Sometimes they are giving it a
low rating.
DR.
STARK: Correct. But as far as --
MR.
MILLER: Yeah.
DR.
STARK: You can debate the inference but
the literature talks about a missed rated of 25 percent which we are going to
equate with actionable nodules. As we
talk about the parent efficacy of this, and I appreciate your honesty, is that
we are taking a standard of care of a 25 percent missed rate that juries and
patients think is horrible in retrospect and we are going to cut that to a 20
percent missed rate. We can judge the
-- that's the efficacy.
MR.
MILLER: I should also add this is just
based on jumping from one 50 to the other 50 on the curve. We did another set of analyses based on what
happens if you jump from 50 to the other point on the curve where you -- I'm
sorry.
I
should say jump from 20 from one point on the curve to the other point with the
same PBD and jump from 20 to the same point without sacrificing the false
positive fraction. That also was a
protocol specified analysis and the numbers go down a little bit. I don't remember how much but it may be five
or 10 percentage points.
DR.
CONANT: May I interrupt or just jump in
for a second because you are the slide that I'm curious about. You mentioned it's similar to mammography. It is but it's so different. I'm very interested in the by-case analysis
of this compared to by quadrant. The
reason being I think you have a little bias in your case selection and I'm not
sure if that is okay or not.
You
have the majority of your cases, 62 percent of the nodule present cases, as
people with extra-thoracic disease. I'm
not sure I really care about the absolute number of quadrants you've missed
because once you've got three nodules in both lung fields, who really
cares? It's metastatic disease so I
would want to see these numbers by case.
I
also think the comparison to mammography is very different because I think
that, again, chest analysis is much more multi-focal and reflective of systemic
disease than mammography in terms of a bilateral fairly somewhat independent
process. I would just like your
comments on that if you could take this another step and then do it by case.
MR.
MILLER: We did not do these analyses by
case. I suppose the data are there to
do it. I think the challenge with doing
it by case is that the way -- I should let a physician get up here in just a
second but the way that one would action a case where you had one lung where
you had a very high likelihood of it being something bad, using my simple
statistical language, and you had the contralateral lung where you had something
that was probably bad. That one that's
probably bad may actually be the one that drives the care of the patient.
Figuring
out how you sort of wrap this all up and do something like this at the patient
level with something that was sort of beyond the scope of what I was able to
imagine. I absolutely do not disagree
that it's something that would be useful to try to investigate in some
way. Having said that, I think I really
need a physician to answer the question.
DR.
CONANT: I'm not sure what the answer
is, though. However, in your cases it's
very different if a person -- if you're looking for a primary lung carcinoma
versus metastatic disease so they are very different clinical questions.
MR.
MILLER: Yes. Let me let Dr. Castellino answer that.
DR.
CASTELLINO: I'm not going to answer any
statistical questions. I can guarantee
you that. It is hard to answer that
question. I would like to put it more
in a clinical context of how we read cases every day.
I
agree that if you have a patient with a soft-tissue sarcoma and you find three,
four, five nodules, unless you are in a setting where you have surgeons who
aggressively pursue that, as I was at Sloan-Kettering, at times it is important
to find a six or seventh nodule. There
is a spectrum of surgical behavior.
Let's
assume that you find six or seven you don't have to find the last three. We had very few cases like that. The second thing is that we are not
positioning this product as a lung cancer detection product, although it does
work that way. Patients with lung
cancer who had a nodule, it was not necessarily the primary lung cancer. They may have had lung cancer before treated
post-op, post-radiation.
We
accepted those cases and had a lung nodule in the lung for whatever reason so
it wasn't really as a primary detection issue.
I'm not sure I answered that completely and I do recognize that certain
mammography is quite different, as I think we have discussed before, than chest
CT.
I
would like to go back to a couple of comments you made. If I understood you correctly, I think you
said, Dr. Stark, that the issue was that we had a 50 percent sensitivity for
consensus nodules. As I recall from
looking at that, I think, with consensus we were closer to 80 or 83 with the
classic nodule definition. I'm looking
at the -- you'll see that later with Petrick.
If
you stratify those nodules with what would be more definition that radiologists
would call classic nodule. It ranges
from 83 to 59 I think is the number. Is
that correct?
DR.
STARK: We can study it but I'm trying
to draw data from table 10. When I
suggested 50 percent, it was based on this so maybe over lunch you can --
DR.
CASTELLINO: We can go through it. I thought it was about 59. But I think it's a good point. We would love to have developed an
algorithm, to be very honest, that was 100 percent sensitive but this is the
best we've come up so far. I think the
issue to me as a clinical radiologist is how would this affect me or my
colleagues in practice to find more nodules that we look at a year later and
say, "My goodness. How did I miss
that? Why did I miss that?" The
ROC study, to some extent, I think, approaches that. I think this table here to some extent also would address
that. These are nodules potentially
that could be missed or are missed that the radiologist would say, "I
would have liked to have seen that nodule to make a decision as to whether or
not it's actionable or not." I
don't know if I'm addressing the myriad of questions that you had but I would
like to try to -- if you can rephrase some of them I would like to try to
answer them.
DR.
STARK: If the chair and the panel think
we have time.
DR.
IBBOTT: Let's wait until after lunch
and we'll have that detailed discussion this afternoon.
DR.
CASTELLINO: Can you write them out so I
can think about them?
DR.
STARK: I'm not sure of the
protocol. I'll ask for advice.
DR.
IBBOTT: I don't think there is any
reason why you shouldn't present those questions and let them think about them
over lunch.
DR.
CASTELLINO: That would be very helpful
because they are a lot and I think they are important questions. Thank you.
DR.
IBBOTT: Again, I'll take this
opportunity to ask Dr. Mehta if he has any questions that require clarification
at this point.
DR.
MEHTA: No, I don't.
DR.
IBBOTT: All right. Thank you.
DR.
SOLOMON: Do we have time for anymore
questions?
DR.
IBBOTT: Well, certainly. Especially if it's appropriate now to get
clarification on something before we break.
DR.
SOLOMON: I guess I have a couple of
questions for Dr. Delgado. I guess they
start off by asking you a little bit more about what your experience was with
the system and then, more specifically, did you find that you as a radiologist
or any of your colleagues were using the CAD system or becoming more dependent on the CAD system and not quite giving
it the same kind of read that you would give ordinarily? Also, what was the impact on the time that
you spent on a case? Did it make it
longer or shorter? Why don't you answer
those.
DR.
DELGADO: Okay. Thank you.
I think those are good questions.
First of all, we did not do any time analysis with and without CAD or
separate, just soft-copy interpretation and then soft-copy interpretation
without CAD and then subsequently with CAD.
I
think it goes to say that if you are doing the second review that there might
be a time factor that would be slightly increased and that may be something to
be quantified. However, in my
experience I think, first of all, the first question is people were instructed
through the training phase that this device was to be utilized through a
primary read in which you make decisions on whether you see or detect a lesion
and then there is a way for you to mark it.
Then you activate the CAD and then you go through, as Dr. Castellino
said, really not the whole entire study again but only those images that
identified a lung nodule. It might be
on average three per case or so where you might click on a button and that
would take you immediately to that axial's image and show you a lesion of which
then the radiologist would make a decision, "Did I miss this? Is this a significant mark that I would
consider actionable?"
Or,
if not, then easily discharge and be done with it. If it was a mark that is consider a false positive, that would be
discarded easily. I think we did have a
few of our radiologist which initially asked the question, "Well, is this
benign or malignant?"
Yet,
we made sure and I as the principle doctor in charge of this made sure to
remind them that this was not the purpose of this device. It's really only to present you with a
nodule that you may have missed and give you the ability to either add that to
your findings or completely discard it.
Does that answer your question perhaps?
DR.
KRUPINSKI: This will probably be more
for Dave. On point of clarification,
you've got a quadrant and suppose the CAD during the initial view the reader
says there's nothing there. There
really is a nodule and then the CAD comes up and points out the nodule and a
false positive.
Now
the reader increases their confidence and now do you consider that in the
analysis and how can you be sure? Do
you consider that a true positive and an increase in behavior when, in fact,
the radiologist was looking at the false positive? Is there anyway without localization to establish that? If
you were then to take your cases and throw away any instances where the CAD
marked a true and a false positive and the reader went from "false
negative to true positive" what then happens to the ROC curves? Admittedly, although you've got statistical
significance, those curves are pretty darn close and you've got these ambiguous
cases now. How do you deal with that?
MR.
MILLER: Well, the short answer is that
we don't know precisely what happens in those instances. It was not captured. Bob Wagner talked about this best of both
worlds scenario. We really tried in the
way that we did the study not to take the readers out of their normal reading
environment.
We
felt that was very important and so capturing additional data was something
that we thought could take them outside of their reading environment and create
some kind of placebo effect essentially.
We don't have that data on which one of the nodules or which one of the
findings, I should say, which one of the CAD marks they are reacting to.
Now,
having said that, we did after we completed the ANOVA-after-jackknife analysis
you can pull out from that analysis which cases are the ones that were most
favorable in terms of producing a CAD effect and which cases are least
favorable in terms of producing a CAD worse effect.
I
sat down with a dozen or so of those cases with Ron Castellino, our chief
medical officer, and went through them and said, "Is it obvious what
they're reacting to here?" In the
overwhelming majority of the cases it was obvious what they were reacting
to.
The
number of marks per case is small enough that it is fairly unlikely -- I should
say fairly. The case where you have
multiple close to positive findings in a quadrant is not very common. It's common to have two in a quadrant but
most of the false marks are very easily dismissable.
I
mean, our engineers hate it when I say this but there are some vessels. I mean, not a statistician I look at it and
I say, "That's a vessel." So
the radiologist, it's really easy for them to dismiss those.
I
guess the short answer is we did not do the analysis that you are suggesting
but I completely take your point that it's important to figure out what was
really going on in the ratings. I think
I have a pretty good feel for it that they were reacting to true positives.
DR.
KRUPINSKI: So you rate them all as true
positives?
MR.
MILLER: Yeah. I mean, the only thing that -- I mean, just from a programming
perspective, the only thing that is fed into the analysis is the truth for the
quadrants and the ratings. Whether
there were or were not CAD marks there is not actually in the analysis.
You
could do an analysis that was more of a parametric model and a fixed effect
model where you tried to capture whether it was the quadrants with CAD marks
that were causing the increase, but I think it's reasonably obvious that they
are in trying to model that it gets pretty messy building that on top of the
models that we already did.
Just
while I'm up here, I did really quickly want to comment on the issue about the
sensitivity, the back and forth about that table. I think you were doing a weighted average of some numbers in a
table and we'll come back to that later, I think.
The
sensitivity number -- I mean, it's just incredibly variable depending on sort
of which reference truth you use and so if you hear different numbers going
back and forth, it's not necessarily inconsistent. Two people may actually be both reading sort of off the same page
but in a slightly different spot on the page.
Thanks.
DR.
IBBOTT: Thank you. At this point Dr. Stark has a couple of
questions he's going to raise now to be discussed later this afternoon.
DR.
STARK: Actually, it's a response to Dr.
Castellino's question which I respect and it's fair. I have been working very, very hard for this because, as we'll
discuss later, I have spent 15 years wondering why my ROC based prediction that
MRI for detection of liver cancer in 1985 was significantly better than
CT. That was wrong. I think I know why and I think this group
here, the industry group and the panel, I think, were at the nub of it.
Dr.
Castellino, rather than have us giving the formality and the importance of this
scratching on pieces of paper, I've asked the chair to allow me to read. I've formed a question and I'm going to read
it into the record and I'll give you my handwritten copy of what I'm going to
read just so that we're clear on this.
Forgive me. You've seen me
scrambling over three minutes here. If
any of this is unclear, I'll rephrase it.
Thank you for offering to do this.
Would you please calculate
from the data and/or literature discussed or presented here today, and in your
submission, the net decrease in false negative rate which we have here today
estimated to be 24 percent for practicing radiologists working by themselves
when those radiologists in the future, we're projecting, are to add this
technology and these results, these data to their practice, specifically
accounting for what Dr. Conant was just asking about, accounting for and not
crediting as a detection or improvement with the addition of CAD those
quadrants or patients as you compile the data where CAD marked a false positive
lesion in a quadrant where the radiologist alone had a false negative.
Where
that radiologist, in other words, failed to recognize a true lesion false
negative for the radiologist that was not subsequently marked by the CAD.
I
have this written down. I think that
translates into English and I would be happy to clarify. Feel free to grab me during lunch if there
is some nuisance of that that would make a better question.
DR.
IBBOTT: All right. Thank you.
At this point then, we'll call this session to a close and break for
lunch and we will reconvene at 1:15, just a little less than an hour. Thank you.
(Whereupon,
at 12:21 p.m. off the record until 1:18 p.m.)
A-F-T-E-R-N-O-O-N S-E-S-S-I-O-N
1:18
p.m.
DR.
IBBOTT: Could I get you to take your
seats, please, and we'll continue.
Thank you. I would like now to
call the meeting back to order and I would like to remind public observers of
the meeting that while this portion of the meeting is open to public observation,
public attendees may not participate unless specifically requested to do so by
the chair. At this point Mr. Doyle has
a statement to make.
DR.
DOYLE: Yes. The R2 has approached me and indicated that they have developed
answers to the questions that Dr. Stark proposed at the end of the morning
session. In an effort to keep the
meeting moving with the schedule we have, I have asked them to present those
answers at the beginning of the discussion section this afternoon. They have the answers ready and I would just
ask for the flow of the meeting to present those at that time. Thank you.
DR.
IBBOTT: Thank you. We will now continue with the FDA's
presentation on this PMA which will be introduced by Dr. Phillips.
Dr.
Phillips.
DR.
PHILLIPS: Well, in case you forgot what
we're doing over lunch, we are discussing the image checker CT CAD by R2
Technology. It is a system that
analyzes and displays to assist radiologists in review of multi-slice CT exams
to the chest and in the detection of solid pulmonary tumors.
It
is composed of several items. It's a
combination of software and a computer.
The system is a work station which is the image checker CT Model LN-500. This was approved for marketing under a
510(k) K023003, the software which is the operating system for the product that
we are looking at today.
Again,
the indications for use, and I don't need to read those. Then this was reviewed within FDA by a
rather extensive team. Michael
Kuchinski was the team leader; William Sacks was the clinical reviewer; Teng
Weng was the statistics reviewer; Robert Wagner and Nicholas Petrick were
reviewed for analysis methodology; Joseph Jorgens reviewed the software; Larry
Stevens did bioresearch monitoring; Fleadia Farrah did the manufacturing. That's the quality systems regulation; and
Ronald Kaczmarek reviewed it from epidemiological basis.
Two
people will present to you today, Bill Sacks and Nicholas Petrick, discussing
the PMA. The other reviews were all
found to be satisfactory and we are moving on from there.
With
that, Bill Sacks.
DR.
SACKS: I apologize for the jaundiced
look of that. It wasn't so bad in the
rooms we were testing this in. Okay. I'm going to just give some background. Then Nick Petrick will present the data from
the clinical study and then I'll come back and draw some conclusions.
The
outline of my introductory comments, I'll say something about the character of
the device for those of you who did in fact, forget over lunch something about
the clinical utility, a point about the instructions for use, and some issues
that are new to this particular PMA.
First
on the character of the device. Just to
remind you, this is for chest CT scans and for CTs that are done for any
indication the algorithm is trained to detect solid lung nodules, not, for
example, ground glass opacities. It is
trained to detect nodules between 4 and 30 mm.
Also
there was a Hounsfield unit cutoff which is just CT numbers, the amount of
radiographic attenuation that needs to be above -100. In particular, this is a computer-aided detector. Just to say a word about the difference
between computer-aided detection and computer-aided diagnosis, a point I made
earlier.
The
difference between detection and discrimination lies not in the instrument but
in the clinical use to which it's being put.
The detector system, which is what we're talking about today, this
left-hand column, scans entire images whereas a discriminator only scans portions
that are selected by the user. The
detector marks the images where a discriminator will give a level of suspicion
that is just a number. As I say, the
same device will do both but it is thresholded to give you marks when it's
acting as a detector.
On
clinical utility, as we've heard, many nodules are missed in clinical practice
for two major reasons. One, other
pathology distracts and hundreds of images are present in one CT of the
chest. Indeed, you may start out as a
board certified radiologist and after reading 500 images you are certified
board.
A
CAD is intended to reduce the missed nodules, this CAD. That is, it is intended to increase the
users sensitivity to detecting lung nodules.
We will come back to this point.
Instructions
for use. The important points are that
the reader should review the films unaided first. Then the CAD marks the candidate nodules. Then the reader looks again in the vicinity
of those marks.
If
the CAD fails to mark a nodule that was judged actionable on the initial
unaided review, the instruction in the labeling reads that the reader should
retain that initial judgment, not back off just because the CAD failed to mark
it. We will come back to this in my
closing comments.
Issues
that are new to this PMA are should the particular choice of target for the CAD
algorithm, the definition of truth, the unit of analysis and endpoints. I'll say something about each of those.
First,
on the CAD target, the target is not malignant nodules but actionable nodules
as we've heard which, among other things, means that the definition of truth is
not based on biopsy or tissue histology which would be an external standard,
but rather based on the judgment of an expert panel that is an internal
standard based on the very images that are being evaluated here.
The
unit of analysis, as we've seen, at one level of the statistical unit is the
person but it's further broken down into long quadrants and Nick Petrick will
say more about that.
Finally,
the end points. One could do an entire
ROC curves as was done and one could, as Bob Wagner explained this morning, in
addition, or instead of, do the sensitivity and specificity of a particular
action recommendation which was not, in fact, done in this particular study.
In
summary, again, just to remind you, the clinical study consisted of three
expert radiologists drawn from a group of 11 but three at a time on a panel to
determine what was called by the company reference truth for each nodule. Then there were 15 completely different
radiologists with a range of experience, not necessarily experts, that were
called the readers and they all 15 read all 90 cases and the 90 subjects were
divided into 360 long quadrants. Those
15 readers used a 100 point scale for a confidence and actionability rating for
each case.
Now
I'll introduce Nick Petrick who will give you the clinical data.
DR.
PETRICK: Okay. So my name is Nick Petrick and I will go
through -- let me see which one of these work.
I'll go through the clinical results that were done by the sponsor and
some of our perspective. The outline of
my talk will be first to talk about the applicability of Az in the
analysis. Here I'm using the term Az
which is somewhat more of a technical term but this is the same as the area
under the curve or AUC. Other people may
call it area under the curve or AUC but I'm going to use that as meaning the
same thing here.
I
will also talk about and somewhat review what the sponsor presented on the pool
of cases used for the clinical study.
I'll talk about the definition of actionable nodules by the panel of
experts. Then I'll go into the
particulars of the clinical study.
In
particular, I'll talk about the primary analysis which was analysis using a
fixed panel of experts and then what is somewhat of importance here, the secondary
analysis which was the analysis using random panels of experts.
Then
I'll finish up my presentation by talking about the measurement of CAD
stand-alone performance. When I'm
talking about stand-alone performances this is the algorithm performance with
no reader involvement.
Okay. So for the applicability of the agency here,
I show one of the sponsor's curves for the average reader ROC from predisposed
CAD and this had a change in the area under the curve of .024 and a p-value as
shown there .003.
What's
important to note about the applicability of the Az is that degree in curve
here is the pre-CAD and the reddish curve is the post-CAD. And what we're looking for is that the two
curves don't cross. That is an
important measure if we are going to use Az as an overall performance measure
for ROC analysis. What we find from
this average curve is that generally the post-CAD curve is higher or on the
same order as the pre-CAD curve.
So
just to summarize this, the pre-imposed CAD curves did not cross in the average
performance I showed before. I think,
more importantly, there was no substantial pre or post-CAD crossing in either
the average or individual ROC curves.
This is important. That makes the Az statistically
appropriate performance measure for this type of analysis. If they had a significant crossing, we would
have had to look at some sort of partial area or some other measure of
performance in that situation. Because
of this conclusion the sponsor had used an Az as a figure of merit in all their
analysis that follows.
Okay. Now to talk about the pool of readers. Again, just sort of a summary of what the
sponsor had talked about before. There
is a pool of cases. There was a subset
of that which was made of nodule cases.
These were documented cancer cases so the primary neoplasm or
extra-thoracic neoplasm with presumptive spread to the lungs. That is the set of nodule cases. The cases were allowed to contain non-nodule
pathologic processes, things like pneumonia or emphysema and so forth were
allowed to be part of that subgroup.
They
took another set of cases. These were
considered the non-nodule cases and what they term or what can be termed as
normal cases where there was no nodule deemed present by the site PI and that
site PI primarily relied upon original radiology reports in coming to that
determination.
These
cases could include a history of cancer, radiation therapy, or even previous
thoracotomy were allowed to be in this data set. This is a pool of cases that now the sponsor will pull out cases
to run their ROC reading studies from.
At
this point we're not going to talk about -- we are going to talk about
actionable nodules or the object of interest in this application. In particular, there is a panel of expert
radiologists that identified the actionable nodules. This was done in a two-stage process, again, just as a review as
before.
In
the first reading the cases were independent and blinded by three expert
radiologists. The information provided
to the radiologists were the subject's age, gender, and indication for the
exam, obviously along with the exam as well.
Each
individual radiologist marked all findings deemed to be lung nodules. Then the radiologist provided ratings for
each of those nodules so there is a detection test and then there's a rating of
the actionability of that nodule. It
could have fallen into an interventional category. That is an actionable finding where further workup was advised.
A
surveillance which is, again, considered an actionable finding which was
monitored with follow-up studies and this would probably be more typically
additional CTs. Also, they could have
rated as probably benign calcified.
Again, no action required here, or probably benign noncalcified, no
action required.
After
the first pass was done, findings that lack 100 percent consensus after that
first pass were reviewed unblinded by all three radiologists and basically they
are going to reevaluate locations where either two out of three of the panel or
one out of three of the panel call the location a nodule. then the radiologist would rate or rerate
these on the actionability of the nodule candidates.
Along
with this thresholding was applied to match what the general performance of the
area where the algorithms should be performing and so thresholds of greater
than 4 mm. in diameter for each nodule candidate and a peak density of greater
than -100 Hounsfield units. This
considers a CT number and is related to the attenuation coefficient in
grayscales in the CT exam.
Then
after each nodule was identified, each lung quadrant was categorized based on
the highest actionable finding within that quadrant. Then subsequently the quadrants will be used in the observer
studies.
Now,
just to summarize what was found in that initial pass, again, this is three
experts per panel. I'll show in this
column the unanimous actionable. That's
three out of three finding. Majority
actionable two out of three. Minority
actionable one out of three. You can
see that for unanimous actionable there was 142 findings. For majority there were 168. For minority there were 149 findings.
This
gives you somewhat of an indication that panel variability is an important
component here. There's a lot of cases,
almost a third -- only about a third of the cases were unanimously actionable
and another third or so were two out of three, and another third were one out
of three. This gave the FDA an
indication that panel variability was an important component and probably
should be taken into account in the clinical study.
Now
to go into the clinical study, there were multi-reader, multi-case ROC observer
studies. Again, the test statistic was
the Az or area under the curve. I'll
present net results based on analysis of 90 case data set, 360 quadrants. The sponsor also performed a 32-case study
and also presented pooled results of the 32 and 90 cases. I'll just limit myself to the 90-case study.
What's
important the MRMC allows us to look at the variability, confidence intervals,
and significance testing and we can take those into account. That is important obviously in this case to
determine significance and then to try to get an idea of what the separation is
between the reading without CAD and reading with the CAD device.
In
order to analyze the variability confidence intervals and significance two
approaches were used, ANOVA-after-jackknife and bootstrap analysis. So here is just the general flow chart to
the clinical study and this will be followed for all the clinical studies. The study starts out with a pool of
readers. These are going to be the group
of radiologists that are going to actually read the cases and give rankings for
each quadrant.
There's
a pool of cases and there's a pool of experts and the experts will be used to
define truth. There will be a sample
pulled out of cases. It will be used by
the pool of experts to define nodules.
There will be a set of readers picked out. Those cases will then be read using multi-reader multi-case ROC
observer study and an estimate of the Az will be calculated. This could then be redone for different case
sets, different reader sets, and potentially different experts on a panel.
So
the important components here are how to measure the variability confidence
intervals and do significance testing.
Again, two approaches were taken, ANOVA-after-jackknife analysis. This is a parametric type of analysis and
just jackknife if a leave one case out type of analysis.
Again,
we're talking about leaving out a whole case so you're leaving out all four
quadrants together and then performing a quadrant-based analysis on that. So just as a quick example, if we had a case
set of case one, two, and three, when jackknifing is performed or leave one
case out, the first partition is going to be one and two. We've left out case three. The second partition may be set case one and
three, case two has been left out.
Finally
partition would be two and three leaving case one out. Then using those partitions and looking at
the pseudo values that come out of that you can use ANOVA to estimate the
variability confidence intervals and significance. The analysis assumes modality as a fixed effect and readers,
cases, and all interactions as random effects in the ANOVA.
A
second approach to doing this is bootstrap analysis and this becomes important
to look at variability of the truth panel.
This is, again, just to repeat, is a nonparametric analysis. What happens is randomly generated data sets
are created based on the original data using replacement. Just as another quick example, with a case
set of one, two, and three again when you run bootstrap you use replacements of
the first partition, randomly pick maybe case three, case two, and case
three.
When
you do the analysis you assume that case three and case three are really
separate events and we bootstrap across those to get those potential
partitions. The second partition you
may pick case three, case one and case two.
Here all the cases have shown up equally. Then a third partition may be case one, case one, and case two
and so forth.
So
the primary analysis, again, the same basic diagram as before but now there's a
resampling scheme introduced into the analysis. The resampling is used for the pool of readers, again, the people
that are going to -- the radiologists that are going to rank the quadrants and
the pool of cases.
The
truth is based on a fixed three-member nodule definition panel, again, based on
unanimous consensus. The analysis will
be based on ANOVA-after-jackknife. Also
bootstrap analysis was also performed.
What happens here is a pool of readers go in. It's resampled so it picks out a subset of readers. Likewise a subset of cases is selected using
a resampling scheme. The cases go into
the definition panel where the panel is fixed and define the actual nodules of
interest or the quadrants that are positive or those that are negative.
The
set of readers are then randomly selected and go in and perform the ROC
experiment. That gives one estimate of
Az. This process is repeated either
through jackknife or bootstrapping in order to get estimates for the
variability and allow for confidence intervals and significance testing.
So
just the result of the clinical study.
Again, this is for a fixed three-member nodule definition panel. In the first column I show the pre-CAD Az
for both jackknife and bootstrap. The
second column is post-CAD, the change in the Az, the p-value for that
particular test, and the lower and upper confidence intervals.
You
can see that the results are fairly consistent between both jackknife and bootstrap
with a pre-CAD Az of .881 or .879, post-CAD increasing to .905 or .903. With change on the order of .024 we see
fairly small p-values for both the jackknife and bootstrapping. Then the confidence intervals also fairly
consistent.
We
wouldn't necessarily expect the bootstrap and the ANOVA to give us the same
values but it's nice actually to see that there is consistency here between the
two analyses.
So
just some conclusions on the primary analysis.
The sponsor has shown a statistically significant improvement in Az from
pre to post-CAD and that is on the order of .024 or change in area under the
curve.
The
ANOVA-after-jackknife and bootstrap analysis showed consistent performance in
both significance and confidence intervals.
The analysis, however, was limited because it did not take into account
any variation in the expert panel.
Variability of the panel would add uncertainty to the performance
estimates, or we anticipate that variability in the panel would add uncertainty
to the performance estimates.
This
is, I think, an important factor because we don't have this cold standard of
truth. We are dealing with a panel
truth. We expect if we sampled a new
panel, they may come up with a different set of cases. They certainly would come up with some
different nodules there.
One
of the important questions is how would performance change with a different
panel makeup. That is one of the
questions that we had talked to the sponsor about addressing. In particular, looking at a different number
of panel members so if you have a different panel makeup or a different
definition of truth potentially and different sets. What happens if another set of experts was used.
So
a secondary analysis was conducted here.
I'll show there are many different types of analysis done by the
sponsor. I'll concentrate on one set of
random panel makeup. This will be based
on a random three, two, or one-member panel, nodule definition panels and
assuming the definition for truth is unanimous consensus.
Because
of this type of analysis the ANOVA-after-jackknife isn't applicable at this
point so only bootstrap analysis is possible.
It follows a similar scheme as before.
We, again, start with a pool of readers, pool of cases, pool of
experts. Here, however, bootstrapping
is applied to the pool of experts as well so that we have a different panel
makeup for defining truth. That adds
variability into that definition of truth and we can use our MRMC ROC observer
study to take into account that variability.
So
we use bootstrapping to select a group of readers, a group of cases, and a
group of experts. Again, with that
particular combination we get an estimate for Az. That study is repeated a number of times to allow again to look
at variability where we have included variability of the truth.
So,
again, these are random three, two, and one member nodule definition
panels. When I'm talking three-member
panels I'm saying unanimous consensus.
Three out of three have to agree.
When I get results for two members that means two members.
They
both have to agree. Obviously for
one-member panel it is the opinion of one of the members. The sponsors randomly sampled that panel so
that we get the added variability from having many different experts involved.
Again,
the same layout here. The pre-CAD Az,
the post-CAD, the change, the p-value, and the lower and upper confidence
intervals. We can see from pre-CAD this
measurement of performance was .845 increasing to .868.
For
the three-member random panel a change of .022. For a two-member panel it was .832 increasing to .854, again a
change of about .022. One-member panel
.817 increasing to .838. Again, a
change of about .0. This is 21 but very
similar 0.22 on average.
We
also see fairly consistent upper and lower confidence intervals for all
different definitions of the truth.
Then we see the significance values which are fairly small as well. That's sort of interesting because what I
talked about before was that we expected when we incorporate randomness of the
panel in here, we would see an increase or a decrease in the statistical
significance that this would be a harder -- that it would be harder to chose
statistical significance.
Really
we see similar p-values to what we saw when we had a fixed-member panel. One of the possibilities or one of the
trade-offs that may have occurred was something that Dr. Wagner talked about
this morning where when the definition of truth is varied, we have also varied
the case mix or the differentiation between negative and positive findings so
we have now moved ourselves potentially more off the curve where we have a more
closer balance study which gives us effectively a larger number of cases or a
larger number of effective cases.
That
was traded off against the variation in the truth. Those seem to potentially have traded each other off where we
don't see a big difference in the performance.
This is one possibility. It's
certainly not conclusive in any way but it is somewhat surprising that we didn't
see a larger variation in the truth when we randomize it.
So
just some conclusions on the secondary analysis. This analysis take into account the random nature of the expert
panel for defining actual nodules. In
particular, it took into account different number of panel members and
different panel makeup using a bootstrap selection of the panel.
All
variations of the panel make up confirmed a statistically significant
improvement in the Az from pre to post-CAD and this change was on the order of .02. And just a more general conclusion, this
type of analysis where we actually tried to randomize the panel makeup is
likely to be a more appropriate type of analysis for assessment of devices when
panel truth -- when only panel truth is available. That's obviously the case here but we can anticipate other
devices potentially coming in where this will again be an issue.
Finally,
I would like to talk about CAD stand-alone performance. In particular, this is a performance of the
CAD algorithm alone and it's the algorithm's sensitivity and specificity with
no reader involvement so we are just going to measure the performance of the
algorithm on some set of cases or defined nodules.
Why
may this be important? Well, it's
generally important because the radiologist can use this information to
appropriately weigh their confidence in the CAD marking so this is a
measure. If you are a reader or a
radiologist trying to purchase this device, you generally like to know how it
would work. Or if you have the device
to use, to get a feel for how it's performing and what it might be marking.
Likewise,
it potentially can be used as a benchmark for future revisions of the algorithm
so as an FDA perspective knowing some benchmark of performance may help us to
determine how to evaluate new revisions of this particular algorithm when it
comes in.
The
question becomes what's an appropriate performance measure for this particular
device and this isn't necessarily an easy question to answer. Anecdotally the sponsor went back and looked
for the unanimous three out of three fixed-member panel and look at those on
the appearance of the nodules that the radiologist marked.
What
they found was that many of those 142 findings did not meet the criteria of
solid discrete spherical density. They
subsequently went back and reconvened a second panel to reevaluate the nodule
but only based on appearance. Not to
find new nodules but just look at the appearance of those nodules defined.
They
put together a set of five independent radiologists and they were asked to
categorize the nodules into two categories, either what they define as classic
nodule. These are discrete, solid,
spherical ovoid nodules, or as nonclassic nodules. These would be nodules that may not be discrete. they may be hyperdense, irregular in
shape. They may be potentially normal
structures that for whatever reason may not be considered nodules at all. This new panel is only going to look at the
appearance of the nodules and determine whether they are classic or nonclassic
in appearance.
This
is a performance. In the first column
I'll show the number of panels defining the nodule as classic. Again, there was a total of five. I'll just group together zero, one, and two
out of five. I'll give the number of
findings. The true positive fraction,
the sensitivity of the CAD algorithm to those particular subset of cases.
In
general I'll just summarize the CAD false marker rate. Then I'll give a final column to the median
diameter of the true positives detected.
This is just to give an idea if there is any bias on the size of the
nodule based on how many panelist defined it as classic.
So
in the first category less than three out of five there was a total of about 65
findings. The sensitivity was on the
order of about 32 percent. For three
out of five there was a total of 13 findings, sensitivity of approximately 70
percent. Four out of five of the
panelists saying this is classic in appearance the performance jumps up to
about 82 percent. All five the
performance is about 83 percent.
If
you just combined all these findings together a total, again, of 142 based on
the definition of truth. The
sensitivity is on the order of about 59 percent. The CAD false marker rate, it varied between two and three
depending on whether the sponsor incorporated or didn't the equivocal
nodule. If you had a five-out-of-five rating, what you did with the
zero, one, two, three, four out of fives whether you included those or not as
false positives would change the median false marker rate but it's on the order
of two or three per case.
In
the final column we see that this is a range of the diameter to those true
positives. You can see that it ranges
from about eight to nine. For the less
than three out of five it was 7.4. For
three out of five it jumped up to about 11 and fell down to seven again. The idea of this column is just to show
there doesn't really seem to be a bias associated with how large the lesion was
based on how they rated it as classic or not.
Just
as a final summary, if there was less than three out of five panelists, there
was approximately 65 findings and the sensitivity was about 32 percent. If it was greater than three out of five,
there was about 77 findings. This is
about half and half -- relatively close to half and half for the data set. The sensitivity jumped up to about 81
percent.
So
just in summary for the CAD stand-alone performance, what was found by the
sponsor was there was a large variation in performance of the CAD based on the
physician's assessment of the nodule's appearance as classic. Whether it was classic or not would make a
big difference on how well the CAD performed.
Just
a note, generally the CAD -- the sponsors talked about the CAD being associated
with these discrete spherical types of lesions and not necessarily some of the
other types of lesions that were potentially marked.
So
just in summary for this part of the presentation, what the sponsor found was
that the -- what we found was that the Az was an appropriate test statistic for
the clinical analysis and this was based on the fact that there was no
substantial crossing of the pre and post-CAD ROC curves.
The
primary analysis, this was based on a fixed three-member expert panel. It showed a statistically significant Az
improvement in the detection with the CAD.
What was also found was the ANOVA-after-jackknife and bootstrap showed
comparable significance testing and confidence intervals.
The
secondary analysis, this was with a variable number of panel members where the
sponsor varied the number of panel members.
They also varied the panel makeup using a bootstrap selection of the
panel members so this is a random panel mix now. This confirms statistically significant Az improvement in the
detection with CAD.
Then,
finally, for this CAD stand-alone performance what was found was that there was
a large variation in CAD performance based on the reassessment of the nodule's
appearance. A more general conclusion
from stand-alone performances is that this type of analysis is necessary for
appropriate utilization of the device by the clinicians in the field and for
potentially reassessment of future algorithm revisions.
Now
I'll turn it over to Dr. Sacks again to make some conclusions.
DR.
SACKS: Okay. I want to then draw some clinical conclusions about this
statistically significant gain.
Granting the statistical significance of a gain in Az of .02, what is
the clinical significance and this is a point that was discussed somewhat this
morning.
Let
me recall for you an earlier slide that I have excerpted this from. That is, that the clinical utility of this
device is that the CAD is intended to reduce the number of missed nodules. That is, it is intended to increase the
user's sensitivity, not increase the area under the curve, although that is
related.
A
gain of .02 in Az understates the relative gain in sensitivity. Why is that? When the CAD is used according to instructions to retain all
judgments of actionability, even if unmarked by the CAD, the user always
necessarily maintains or increases his or her sensitivity and, indeed, always
maintains or increases the false positive fraction as well. They both have to go up. They could stay the same but that would be
an extreme case that wouldn't likely happen, but they cannot go down either
one.
What
that means in ROC space is that -- let me walk you through this slide -- the
blue curve is intended to be a representation of the unaided initial
reading. The red curve is the aided
reading. We've been talking about the
difference in area between under the red curve and under the blue curve.
But
if you talk about a particular operating point on the blue curve unaided and
ask what happens when you use the CAD, you move to some point on the red curve
and if you obey those instructions not to back off when the CAD fails to mark
something that you thought was actionable, you necessarily move up and to the
right somewhere in that quadrant such as this arrow here so you move to some
point here.
Now,
Dave Miller showed you a number of representative arrows if you were to use a
particular point on the rating scale on the blue curve and keep that same point
on the rating curve -- on the red curve, the same rating, 80 or 50 or 20.
The
problem is that radiologists while they could read by assigning a number to a
study and always obeying a preset range for themselves saying, "If I
assign any case 70 or more, then I am always going to act on it the same
way.
If
I assign between 40 and 70, I'm always going to act on it the same way. If I assign under 40, I'm always going to
act on it in the same way," then those points might be relevant. Radiologists could do that but I'm a
radiologist and I can tell you radiologists don't do that.
What
they do do is they look at a case and they decide, "Do I act on this or do
I not?" Or if there is a
trichotomy such as in mammography where there is biopsy or short-term follow-up
or return in a year for screening, that is the decision you make. That gives you an operating point that may
or may not lie on the curve that you would construct if you gave a rating.
It
wouldn't necessarily lie on that curve.
It would lie on that curve if you always assigned your action based on a
preset fixed range of ratings. But
because those are done independently, those modes of thinking, the point that
you operate on in terms of actual sensitivity and specificity may or may not
lie on the ROC curve.
For
this particular clinical study we don't know but what we do know is if you
maintain that rule, and you are free to violate it if you are going to, but in
this clinical study people did not violate it and what we can see is if we put
this in the labeling and say to potential users out there, "Stick with
this rule and you are not going to lose sensitivity," then what you're
going to be doing is moving up and to the right.
And
you can see from this gain in sensitivity, this increment here which is along
-- TPR is just true positive rate or fraction.
It's just another word for sensitivity -- that increase is a little more
impressive than .02. I can't quantify
it but you can expect that your gain in sensitivity is going to be
greater. The utility of knowing that
the red curve is higher than the blue is that you know that you're not so
greatly increasing your false positive rate as the fall to a lower curve.
Now,
here is an example. For example, if I
start here and I maintain that rule, I'll go up and to the right but if I
don't, I could fall even though I'm going to a higher curve from blue to red. Those are the same two curves as in the
previous slide. Nevertheless, I could
drop my sensitivity if I don't follow that instruction.
So
any statistically significant improvement in Az means an even greater relative
gain in sensitivity and one achieved without falling to a lower ROC curve if
the reader maintains that rule not to back off if the CAD fails to mark
something that he or she thought was actionable to begin with.
Now,
another point. The real question for
judging the safety and effectiveness of any device is how does its introduction
into general use compare to what we have today where it doesn't exist? The same question applies to a CAD.
Can
we infer from the fact that there was an improvement in the average user
performance measured in terms of Az in a clinical study that the average user
will improve his or her performance, again measured in terms of Az, with the
CAD in clinical practice.
That
is, improve over his or her current clinical performance which is in the
absence of any CAD for miles around. To
put it another way, is the unaided reading in a clinical study a good surrogate
for current CADless clinical practice?
What
I'm showing here is let's suppose it is a good surrogate. Current clinical practice may have a CADless
reading Az somewhere here along some Az scale.
In a clinical study if the unaided reading is a good surrogate for that,
then the fact that the aided reading is higher than the unaided reading, then
the aided reading is also higher than current CADless clinical practice.
But,
for example, in actual clinical practice with CAD, that is, in the future, the
unaided Az could be lowered potentially by failure to read first as one would
normally read. That is, with adequate
vigilance. If this were to happen, then
the aided Az could also be lower than the current CADless practice. And to show that in a diagram, in other
words, if this aided reading had an Az that was significantly lower than the Az
in current clinical practice, it could pull down with it the aided reading so
that it was below your current practice.
In
order to avoid that, well, what would be the implications of such lowering of
vigilance for judging the safety and effectiveness of the CAD? Can labeling help prevent this? Labeling issues. Two rules if followed by CAD users in future clinical practice
with the CAD will help prevent missing more nodules than former reading without
a CAD.
The
first rule is an always rule. Always
read unaided first and as carefully as if you had no CAD. This would help keep the Az of the aided
reading higher than the Az of the current CADless reading. We can't make users of this follow these
instructions but we can guarantee that it's in the labeling.
Secondly,
the never rule. Never back off from
unaided judgment of actionability of a nodule if the CAD fails to mark it. This would prevent the sensitivity from
falling below that of current CADless sensitivity. That is, it would prevent the radiologist from missing more
rather than fewer nodules.
DR.
IBBOTT: Thank you. At this point Dr. Wagner has a short
presentation to make.
DR.
WAGNER: Yes. This is just a trivial comment but it is along the lines, I
think, of where Bill Sacks was going just a moment ago. The number .024 may sound small and he
showed how it may have a bigger impact than that small number sounds.
If
you do an area under the ROC curve here is the good stuff, .85. Here is the bad stuff, .15 or .12 or
whatever. That .024 is also the
correction improving the false negative piece and all the inference that was
done on the area under the curve difference because it's just a difference between
one. Here is the curve and here is the
area under and here is the difference.
The difference is just one minus everything else we've been discussing
today.
The
statistics of one minus something are the same as the statistics of that
something so the area under the curve is also the reduction in false negatives
with all the statistics in there averaged over all the false positive rate so
that is another interpretation of that.
So
.024 may not sound like a lot compared to .86 or something like that, but it
also is to be compared to .15 or .14 or whatever the missing piece is. All the statistics if you consider them
tight for the previous part, it's the same statistics. I don't know if that helps but .024 looks a
lot better compared to .12 or .13 than it does to .85. That is statistically robust. Thank you.
DR.
IBBOTT: Thank you. Before we go onto the lead reviewers, I'll
take a moment to see if people have any questions of these recent speakers,
particularly questions in the nature of clarification again before we get onto
the real discussion later.
Yes,
Elizabeth.
DR.
KRUPINSKI: Can you clarify or explain
without getting into all the gory mathematical details when you go from looking
at quadrants, then the Az is based on patient.
For example, suppose you've got your four quadrants and you've got a
true positive here, a false negative here, a false positive here, and a true
negative here as quadrants. When you
then go to Az on a patient, is that patient true positive, false negative,
false positive, or some weighted combination there? Anybody who knows the answer.
MR.
MILLER: I think it's a quick
answer. So when you compute the Az all
four quadrants are in there for computing the Az but that is to compute it
originally. Then when you do the
jackknife you pull out all four quadrants so, therefore, the jackknifed Az,
which is the unit of analysis for the ANOVA, is based on the case because you
pulled out the four quadrants together.
But
when you compute the Az, you do have each of the four quadrant ratings compared
against each of the four quadrant truths.
This is discussed in Nancy Obuchowski's paper from Biometrics in
'95. I ran those programs as well as
the ones that we did just to make sure that we were getting the same estimates.
DR.
KRUPINSKI: So all decisions are
preserved basically.
MR.
MILLER: All decisions are preserved.
DR.
IBBOTT: All right. Then we will now have some brief
presentations by the panel's lead clinical reviewer, Dr. David Stark, and the
lead statistical reviewer, Dr. Brent Blumenstein.
Dr.
Stark.
DR.
STARK: Thank you. I would like to begin by congratulating the
applicant, the industry in general, the FDA staff, and the panelists. This discussion and the record of it, I
think, documents substantial progress that's been made in the methodology for
research and product development in this field of computer-aided diagnosis or
detection and, frankly, the verification of these results so that we can
apportion resources responsibly and regulate and improve overall quality of
clinical care.
As
is noted in what I've read from the FDA's notes from last year, in particular,
this application, this issue, is a very prodigious task and the technology is
quite similar to really more to putting the Spirit and Opportunity on Mars than
most other things that we clinicians face or have historically faced in our
training in how to decide how to care for patients.
Nothing
really could be further from the way a surgeon decides whether they are going
to start doing laparoscoic cholecystectomies without any review or oversight as
opposed to exploratory laparotomies.
I'm a little bit concerned about the fastidiousness and the zeal with
which we are putting -- are obsessed
with technology. I'm assessed with
technology.
Some
of the panelist here are devotees of little PDAs like I am and things but there
are many red herrings here. There are
many unintended consequences and this is an extremely important task we have in
front of us, not to move too quickly and not to move too slowly.
I
just want to remind everybody that with those two space landers, spaceships
that we've been following with our families and children, the unintended
consequences for something that is manmade and simply mechanical like an
overheated solar panel or a flash memory that's choked with data, this is more
complicated than the Challenger accident or putting Spirit and Opportunity on
the moon in my humble submission.
That
is because we are limited. We chose to
go into biology but there are enormous biological variations here. Even as a group I doubt we have the
collective wisdom and strength to recognize all of them given the time that
we've had.
There
are numerous coincidental clinical issues and this panel has focused largely on
what I believe is a red herring of the statistic of Az. I implore people not to think that just
because we can launch a bottle rocket we can reach the moon.
My
own papers which I have cited to the committee have shown a larger and more
convincing increase in the detection of liver metastases with magnetic
resonance imaging using exactly this same methodology with some of the same
authors and we were wrong because of some of the issues that have been raised
here today.
A
statistically significant phenomenon in the laboratory with all of these little
nuisances tearing and pulling at it, and these are only a fraction of the
issues, can give us the enthusiasm that we can reach the moon but there will be
problems with insulation flying off rockets and things like that. The unintended consequences are what I'm
concerned about as we talk about safety and efficacy.
First
about some of the red herrings. The
burden of the 300 scans, it is a burden to have to look at so many images but
that's a bit of a myth. One of the ways
that we have improved our efficiency as radiologists in reading these images is
we no longer tile them.
We
melt through a stack with a trackball so the soft film reading to a certain
extent mitigates the number of scans.
It really doesn't matter if you have 50 or 500 to a large degree if you
are trackballing through a stack.
Furthermore,
this product doesn't address that issue because it really is largely asking the
radiologist to still do his conventional work and then add additional readings
to it, albeit slice by slice computer selected.
The
problem that we're here to face, to solve, and the industry is trying to
address, as the physicians are, is that we have a false negative rate in
detecting nodules in the lungs that is unacceptable to clinicians, to the
public, and to healthcare providers and those who fund it.
This
study population does not reflect the fact that the false negative rate of 24
percent is a number we've all been, I think, using by assent here today but
it's different depending on the study group, I'm sure. That might be plus or minus 10 percent, 15
percent easily.
But
that's 24 percent of one in 100 that's positive. The radiologist faces 99 negative scans for every one that he has
to find. These study conditions the
radiologists faced two out of three were positive and positive perhaps in
multiple quadrants.
The
false positive fraction, which is quite large here, is over a very large
denominator. 99 out of those 100 patients
who have truly negative scans will bear the burden of the false positives, the
patients and the radiologists caring for them.
The
question of efficacy can be as simple as if we assume the radiologist is
perfect and just plays by the rules and adds no false positives, then he has
the ability perhaps to improve his false negative rate which is embarrassing
but it's the state of the art of medicine from perhaps 24 percent or worse to
perhaps 10 percent better. Still
horribly embarrassing so we have some meager pickings and still an unacceptable
result, I think, from a final objective.
Nonetheless, a step in the right direction would be a step in that
direction.
The
false positives, though, while I believe we do have in one of the curves, I
think it's figure 11 on page 53 in one of the two studies, did show some
degradation of performance where the radiologist somehow managed to not be
perfect and to eliminate all the false positives which is unbelievable that
they can do that.
I
believe some radiologists are going to be induced to call things positive. It's just not realistic that another look
when you are prompted to ask you are going to cause some more false positives
and there is the possibility of degradation with scatter around the ROC curve. These will whether they are due to
distraction error, there will be, maybe unmeasurable, but as happens in
medicine unnecessary biopsies.
Dr.
Castellino talked about the effect on treatment if you call five lesions
instead of three. Some surgeons will
say, "I won't operate on three pulmonary nodules and do a
metastasectomy. I will go to
chemotherapy." Or if there's
two. One pulmonary nodule, we'll excise
it out. If there's two and it's a false
positive, no chemotherapy. Unilateral
versus bilateral disease, no surgery.
So
the consequences of a mistake, a false positive is huge because they add to
that minority, that one in 100. And
there are, of course, the complications of follow-up CT scans with or without
contrast. I'll get to contrast media
later. We haven't discussed it today
but one of the claims is that this is effective with or without contrast media
but we haven't seen, I believe, data on that point.
So
one of my concerns is the nonclinical circumstances in terms of the patient mix,
the circumstances of the readers. We
had at least one reader who read 90 cases in a day. There were more than one.
They may have been exceptionally strong readers but we know they weren't
reading under clinical conditions. They
were ignoring many of the things that radiologists are obligated to worry
about. Radiologists
are not limited in their obligation to work with this machine. They have to look at the neck and look at
the spine, look at the ribs, look at the chest wall, look at the abdomen, and
look at the adrenal glands, especially for lung cancer.
So
these radiologists in this study had a very, very narrow task in front of them
not even looking at the pulmonary vessels or the mediastinum. Not even looking at lymphodes. They were just looking at airspace for
nodules abutting airspaces trying to match the technology.
This
technology forces the radiologist, in effect, to work for it even though we
insist the radiologist first do his own job.
He then has to come back, read in a skewed way, and correct the numerous
false positives, protect the patient from the numerous three per study false
positives that this technology causes.
Now,
I'm ignoring cost in this analysis.
I've been instructed that effectiveness here comes at any cost so I'll
leave that for other people to address that but there are still risks because
the radiologist has a certain amount of time and he is going to make mistakes.
The
fast readings may have, as we've heard from the statisticians, made this very
small, though statistically significant.
Increase in Az may make it evaporate.
I submit that even a larger Az, my own papers have shown, is often not
clinically significant for the enumerable reasons that we've touched upon,
albeit quickly, because we've mostly spent our time on the red herring on ROC
methodology. Red herring for a decision
today, I believe, but extremely important for the future of this technology.
If
we do move on to the next phase and improve this product, I did not see that it
calls for significant training of the radiologists. I think the warnings that will be given to the radiologists are
limited and I think the temptation and the ability to misuse the product is
significant.
I
think that very significant discussion, or substantial discussion interaction
with the FDA about what would be appropriate warnings and training and,
importantly, post-market surveillance to see how this actually performs with
realistic clinical readings, not in the unrealistic setting that here was
designed to feed an ROC study.
These
radiologists who were safe here were diligent, paid, and focused on eliminating
this false positive rate. They did not
have to deal with coincident chronic obstructive pulmonary disease, artifacts
from patients having the arms by their side.
Contrast agent given in large boluses which can cause artifacts, change
the appearance of the blood vessels throughout the lungs.
We
didn't discuss how the algorithm operates.
It sounds to me much like it does not use a maximum intensity
projection. It does not identify the
vessels per se. It's really looking at
ovoid intrusions on airspace. Product
development is not -- I don't have enough information to comment further.
Let
me see if I have more from my notes and I'll try to wrap up. Well, I've been asked to state my views and
I hope it's clear that I am sincerely impressed with the progress that has been
made but I think this is an extremely ambiguous and complex project and I am
really worried about the real world pressures on the radiologists in that I
don't think -- I do not believe that we have shown that we have effectively --
that we have demonstrated effectiveness in that -- effectiveness can come in
two ways.
Either
improving our accuracy and assuming that we show that we do not increase the
false positive rate and that we can effect significantly in a clinical setting
the 24 percent false negative rate for real lesions. I think there is evidence there that it's going in the right
direction but I really am not persuaded that we are looking at much more than a
statistical trend that because of the way the study was conducted statistically
reached significance for the ROC.
The
other way to reach effectiveness would be to improve the efficiency of the
radiologist working so the radiologist would have time to read more
carefully. I really do believe that a
careful re-read or a second read of these scans might be more effective,
accurate, and efficient than the use of this modality.
I
believe that we need a placebo study.
There is no placebo study where we see the effect of simply introducing
the random false positives in a population that is 99 percent negative and see
if we do any better at finding that one in 100 who has a true positive.
I
believe this is such a statistically based application and we have such a
skewed set of circumstances for collecting the data, the data set that we
looked at, the way the examinations were done and the very narrow statistic
analysis that was done that I do think we have to look at the history of the
ROC which is unproven that a p-value for an ROC should justify as proof
sufficient effectiveness for FDA approval.
And
in terms of safety ignoring cost, I think that we have seen in at least one of
the graphs provided that there is possible degradation. We have an intuitive understanding that
there is possible degradation and I have no doubt this product will help some
patients but I think it may hurt others in direct and indirect ways. I, myself, would recommend -- I, myself,
would -- I think I'm supposed to say what I think and I think that -- I would
say that I would not think that this is at this point ready for approval. If
the panel disagrees with me and it is approved, I would have numerous comments
about the labeling that we see in the proposed commercial materials. If I'm supposed to, I would be happy -- I've
made some notes on that and I could comment here or leave that for later.
DR.
IBBOTT: I think we get to that later.
DR.
STARK: Okay. Thank you. Thank you,
everybody.
DR.
IBBOTT: Good. Thank you.
Dr.
Blumenstein.
DR.
BLUMENSTEIN: Amazing. It worked.
I wanted to say a few words about my thoughts on the statistical
concerns, some of which you've heard already a bit this morning.
First
of all, I want to say that it appears that the sponsors have done a really
excellent study according to today's standards. Nonetheless, I can't escape concerns about the success and impact
of the device. These concerns are
related simply to the assessing the significance of it. Most of the concerns that I have are rooted
in the unique features of the study design rather than the methodology that I
think has come to be accepted and used in this area. In other words,
there are unique features of this study design that may make this
difficult. I'm not concerned about the
general statistical methodology and, in particular, the resampling part of it,
but I do have concerns about whether all the important features of this study
have been taken into account in the resampling methodologies. Let me explain that.
The
first major class of discomfort I have is the accuracy of the measures of
success. In particular, it's
translation to the clinical measures of success. In particular, we see no measures of uncertainty for the clinical
measures.
In
other words, the Az measure something about device performance and not clinical
performance. While we have been given
some indications of clinical performance by showing ROC curves, little arrows,
and performance points and so forth, we don't have any measures of uncertainty
with respect to those clinical measures.
I'm
concerned about the sampling for the cases that were included in this
study. They were artificially
sampled. Population prevalence is
likely not reflected in the data set that was analyzed and, therefore, it's
difficult to assess a clinical impact of these results without some kind of an
assumed prevalence. This is just sort
of fundamental in any kind of a diagnostic evaluation. I'm not sure this could be avoided. I'm not sure how to deal with it but it does
leave me with some concerns.
Perhaps
one of my major concerns is this, that there is a correlation structure having
to do with this quadrant implementation which was some kind of a partial
localization methodology. I'm concerned
that the correlation -- well, add a parentheses here.
That
the correlation between the upper and lower quadrants on the right lung, that
is the results from these quadrants, is likely to be larger than the
correlation between, say, the upper right and the upper left quadrants in the
same patient. In other words, there's more
correlation within a lung than there is between quadrants of opposite lung.
I
didn't see the computations took this into account in any explicit way. I'm not sure how you would. I'm just expressing a concern here. There's a lack of complete understanding of
the methods used to analyze this kind of a partial localization maneuver to get
to these quadrants.
I'm
also concerned about whether the panel, the expert panel, had knowledge of the
patient's identity -- I assume that they did.
I don't see any evidence otherwise -- so that when they were making a
judgment as to the status of the quadrants within a patient, that the results
of one quadrant may have left them to feel differently about the results in the
other quadrants as they were looking at these things. I don't see that taken into account. I'm not sure how to do it and so on. I'm just concerned about it.
Then
I'm concerned about the incremental structure of the study. The instructions to the readers were
definitely additive. In other words,
they were supposed to use traditional methods and then add the CAD. The computations apparently didn't take into
account the correlation between methods.
That is, this is a correlation between methods, not the correlation
between quadrants of the lung and I didn't see that.
I'm
not sure this makes a difference but I'm left with a feeling that it should
make a difference and it should have been taken into account because the
computational methods of ROC curves and comparing areas of ROC curves and so
forth seem to be based on having done independent assessments of the two
methods. Therefore, I'm left to
wondering whether the p-value would be different had the correlation between
methods been taken into account.
And,
finally, in this area of concern is the intra-reader variability. The experiment didn't measure intra-reader
variability by giving a given reader multiple opportunities to read an image
from the same patient and, therefore, you don't know how that read is going to
perform. How much variability there is
going to be from seeing that same patient over and over and you would want to
do that in a way that they wouldn't know it was the same patient if you
separated in time and so forth.
But
how much would a measure of intra-reader variability modify the p-value
associated with Az? I don't know. I was trying to get at that this morning and
apparently there's not much understanding of that yet. But my intuition is that the intra-reader
variability would be particularly important in the computations of variability
for clinical measures. It kind of goes
like this, that the artificial scaling of measuring on a probability scale, or
however you do it, in order to be able to use ROC methodology depends on
assumptions about the performance of the reader with respect to their consistency
over use. Yet, the clinical measures
that depend on that ROC don't take that into account so I'm left with a bunch
of concerns about whether had intra-reader variability been taken into account
whether we might be seeing different results.
Then
I have just another concern or two, this business about truth. I think it's important to note that the
statistical methods absolutely depend on a definition of truth, but I feel that
the sponsor did the best that can be done.
I have no criticism of that.
But
it's important to realize that the results are conditional acceptance of the
definition of truth as they got from this panel. Then what was going on was that they degenerated truth and I
found that really weird. I couldn't
think of a better word for it.
Sorry. I wondered why the impact
on the variations in readings couldn't have been done.
For
example, I would have liked to have seen some co-variate analyses or some
sampling of quadrants or sides of the lung.
I don't know how to do these.
I'm just throwing these up. I
hope some statistic students are listening.
Maybe it's an area of methodologic research.
The
readers are using readers with smaller or larger areas. This could be like a co-variant or readers
with more or less experience. Or what I
think is particularly promising is maybe you perturb some of the thresholds
that individual readers are doing and this might be some kind of a Bayesian
analysis whereby you throw in some kind of a distribution of thresholds getting
back at that intra-reader variability.
At
any rate, I'm of mixed mind. I'm trying
to be here but I think I'm here. Where
I can read that, it says, "I am a bomb technician. If you see me running, try to keep up."
DR.
IBBOTT: Thank you, Dr. Blumenstein.
All
right. At this point we will see the
questions that the FDA is going to ask of the panel. We will take a break shortly after that. When we come back we'll consider those
questions. I believe Dr. Sacks is going
to project those questions. When we
come back from our break, the first thing we will do is address the questions
to the sponsor and hear your responses.
DR.
SACKS: Okay. I'll go through these slowly but I think you have printed copies
of them. This is more for the audience.
First,
please discuss whether the data in the PMA support the conclusion that the CAD
can reduce observational errors by helping to identify overlooked actionable
lung nodules on chest CTs. In
particular, given that use of the CAD produced a statistically significant
improvement in ROC performance, please discuss whether:
(A)
The use of an expert panel is appropriate for determining actionable nodules
given that a tissue gold standard is not feasible.
(B)
Actionable nodules are a reasonable target for a lung CT CAD to be judged safe
and effective.
(C)
The achieved gain in ROC performance in terms of the area under the curve
demonstrates safety and effectiveness of the CAD.
Second,
please discuss whether the labeling of this device including the indications
for use is appropriate based on the data provided in the PMA.
Third,
please discuss whether the sponsors proposed training plan for radiologists is
adequate. If not, what other training
would you recommend?
Four,
if the PMA were to be approved, please discuss whether the above or any other
issues not fully addressed in the PMA, (A) require post-market surveillance
measures in addition to the customary medical device reporting, etc., and (B)
suggest a need for a post-approval study.
DR.
IBBOTT: Thank you, Dr. Sacks.
All
right. We will take a 15-minute break
and we'll reconvene at 10 minutes to 3:00.
Thank you.
(Whereupon,
at 2:36 p.m. off the record until 2:55 p.m.)
DR.
IBBOTT: Thank you. We'll continue now with the discussion and
we are going to go straight to a response from the sponsor to the questions
that were raised before lunch. I
believe Dr. MacMahon is going to start with that response.
MR.
MacMAHON: Thank you. Again, I'm Heber MacMahon from the
University of Chicago. I would just
like to start out by making a few points that may clarify some of the issues
that have been raised. Let me start by
just mentioning a few smaller issues that have received a lot of attention.
Briefly,
the question of the placebo effect. Dr.
Stark has raised the question whether the need for the observer to review the
case a second time after being prompted by the CAD may have actually improved
performance because anytime there's a second read, there's reason to believe
that additional nodules may be noticed.
However,
I think it's worth emphasizing that the average false positive rate of this
system is three per entire examination.
We're talking about examinations with up to several hundred
sections. What the observer does in
those situations is not re-read the entire study but go directly to those
sections on which he or she is prompted an average of three sections and just
look at that particular mark and decide is that a nodule or not.
I
would suggest that the opportunity for picking up additional true positives in
that situation is really pretty small if one looks at the number of sections
and the number of false positives with this system. I would just like to make that comment.
But the larger issue I would like to talk
about, and I think it touches on all of the questions that have been raised, is
why is the difference in Az so small in this experiment? I think there is a sense of disappointment
with what looks like a very strong CAD detection system. We didn't see a larger improvement.
I
would suggest if we had a larger improvement that a lot of the questions about
the statistical methodology and the design of the experiment would become moot
because it would become apparent that such a large improvement could not be
accounted for by some of these issues.
We
have had a discussion, and both Dr. Wagner and Dr. Stark mentioned, in observer
performance tests it's not like real life.
I can attest to this. I've
conducted several observer performance tests myself, mostly related to digital
chest radiography and image processing.
I have to say if I had conducted my experiments in this way, I don't
think I would have achieved even statistical significance in most cases and I
probably would be here now.
Let
me explain why. There are a number of
factors going on in an observer test.
We've already heard how the observers are working in an undisturbed
environment. They are highly
motivated. They are highly
vigilant. These are radiologist.
We
are sitting them down and we are saying, "All you have to do is find
nodules. We are going to measure your
performance and see how good you are.
You don't have to look at the mediastinum. You don't have to look at the pleurae.
You
don't have to consider interstitial lung disease. We are not going to disturb you.
The telephone is not going to go off.
The technologist isn't going to tell you there is a patient on the table
for a biopsy. A clinician will not stop
by and ask you to look for a study."
This is an ideal reading environment.
For
these and many reasons, the performance in an observer test is extremely
high. Basically observers do not miss
obvious abnormalities by in large in an observer test. But we know from our own experience and from
studies that have been done that radiologists miss relatively obvious
abnormalities all the time every day.
That
is actually the issue we are trying to address and that is the difficulty in
trying to extrapolate from an observer test to clinical practice. I would put it to you that these observers
were working on an extremely high level.
If we look at the average Az before CAD, the average was 0.88. Some of the observers were over .9.
In
my experience when you start at this level in your unaided situation whether
it's some kind of image processing or energy subtraction or whatever, it is
very difficult to show a substantial improvement. There is not a lot of room left for improvement when the
observers are right up there. The
situation we do see a large improvement is when they start out at a lower level
missing a lot of abnormalities that then they can pick up in the second reader
situation.
So
what happened here and why did they perform so well? Not only the observer situation but the selection of the
cases. In many observer tests and the
ones that we quoted in the literature that show a large difference, we tend to
go to difficult cases because we know it's only in those difficult cases that
our CAD or whatever will make a difference so we go to selected cases, perhaps
cases that were missed on the original reading, or perhaps a panel go through
and selects chest radiographs that have subtle nodules.
This
is a very well accepted way of doing it because we know in those kinds of cases
whatever is the modality is likely to make an impact. Although these are selected cases, we know it is usually
impractical to take a random selection of the whole population and expect that
there will be enough of those subtle cases for the difference to be
statistically significant. We do some
kind of selection in most cases.
However,
here although the cases were selected for having a high probability of nodules,
and there was a high incidence of nodules there, they were not selected for
having subtle abnormalities. We have to
assume that most of the nodules were easy.
Most of the observers detected them and, therefore, there was no
opportunity for the CAD to show an improvement.
So
I think that this is a critical point and to me this explains why that apparent
improvement is small in the observer test.
I strongly believe that if this kind of a system were implemented in
clinical practice where we were subject to these various distractions where
obvious abnormalities are missed, there would be a much larger improvement and
this would be a useful clinical system.
In
that regard, even if the amount of improvement shown in the observer test were
going to be the amount of improvement in clinical practice, I would say in my
own practice where I encounter a high proportion of patients with pulmonary
nodules, certainly larger than 1 in 100, I don't know exactly what the number
is but I would say up to half of all the CT scans that I read have either a
nodule or a question of a pulmonary nodule.
This
is a very pervasive issue that affects almost every CT scan we read. In some screening studies the incidence of
nodules has been over 20 percent.
Indeed, the incidence of even cancer in some screening studies has been
up to 2.7 percent in the initial prevalence screen so nodules are not rare
abnormalities.
If
I can reduce my missed rate by 15 percent or anything in that area, I would be
very happy because that is going to benefit a lot of my patients. I'm going to see a benefit multiple times,
probably at least once a day. I would
say throughout the whole country the magnitude of that improvement is not at all
meager or insubstantial.
On
that point I would like to hand off to Dr. Castellino who has some more
comments.
DR.
CASTELLINO: I just have a few. I think Dr. MacMahon has addressed some of
the issues that I was going to talk about but he certainly can do it better and
with more authority.
I
would like to clear up the issue by consecutive cases. These were consecutive cases. They were not selected for nodules. It turned out that the practice where we got
them from had distribution of cases by report with nodule and cases by report
without nodules so there was no selection whatsoever.
I
would agree that I guess it depends on your practice but if you're in a
standard community hospital or hospital setting of some nature, the number of
nodules you see on routinely performed CT scans every day on a variety of
patients, many of which, by the way, happen to be oncology patients,
out-patient or in-patient, is high.
There
was a comment made something like, "We don't want to have the radiologist
work for CAD." I agree. We don't want to have the radiologist work
for CAD. In fact, I don't think the
radiologists do work for CAD if the nature of our product is correctly
understood.
The
only additional work that is required by the radiologist is to go back and
review those several slices, two, three, four, five, whatever it might be, look
at the circle on the image and determine if it's a true positive or false
positive study.
Now,
we have not quantified how long additional time that would take but it probably
takes in the order of anywhere from -- if there are no marks, of course, it
would take no seconds to maybe 15 or 20 seconds. There may be a nodule that is pointed out that the radiologist
has to think about and make a clinical decision.
That
often takes time but that's perfectly fine.
That's the whole point of the product is to get the radiologist's
attention directed at something that may be important and then to tease it out
and decide what has to be done.
I
would echo the fact that a 20 percent or 30 percent reduction in nodules that
are missed might represent only a five percent increase in the nodules that I
detect. I personally think that is a
very substantial improvement in my performance. That is a very important issue.
I
think that I would like to say that perhaps when residents of radiology finish
four years of training and they go on to a year of fellowship. If we can improve their performance and
their subspecialty by five or seven percent compared to the general radiologies
of training, that is probably a significant improvement. I don't denigrate the number
whatsoever. In fact, I think it's an
important number in clinical practice.
There
was a comment about the radiologists may not follow the rules. I think it's an important comment. We don't expect that to be the case. Certainly when we introduce a breast CAD
product, as far as we could tell they were following the rules pretty
assiduously.
Certainly
for the masses which the code is not anywhere nearly as perfect as it could be
or as robust as it could be. But I
think that is probably true of any device you have to consider. Often if there are physicians out there that
will use device incorrectly, I don't know how you address that but certainly
that is not the point of how our device is supposed to be used.
Lastly,
I think it's important to note that should we gain approval and if there are
post-market follow-up studies that are recommended, that should be done to
further investigate the performance in the real world. We obviously would discuss this with the FDA
and would be very happy to do that.
Thank you.
DR.
IBBOTT: Very good. Thank you.
We
are now going to consider the questions and also the panel's questions
regarding the presentation. What I
would like to do given the time is ask that the first question be projected again. I would like to ask the panel to consider
the questions one at a time -- we have the four questions and these have been
distributed to us -- and use this as our opportunity to ask further questions
of the sponsor as they are relevant to the questions that we've been asked to
consider by the FDA.
While
Dr. Phillips is getting that up, I'll remind you the first question is to
discuss whether the data in the PMA support the conclusion that the CAD can
reduce observational errors by helping to identify overlooked actionable lung
nodules on chest CTs.
In
particular, given that the use of the CAD produced a statistically significant
improvement in ROC performance, please discuss whether (A) the use of an expert
panel is appropriate for determining actionable nodules given that a tissue
gold standard is not feasible.
I
would like to invite the panel now to discuss this question. I throw it open to anyone who would like to
lead off.
DR.
KRUPINSKI: Offhand I would say yes, it
is. I mean, I don't see really that
many other ways to do it and I think the analysis where they broke down and
showed the different ways of doing it, leave one of the observers out, put them
back in. I honestly don't think there
is any other way at this point in time that you could get at some other truth
than using an expert panel. I think it
was appropriate.
DR.
IBBOTT: I would be interested in
knowing from the radiologists on the panel and other people in the room if the
variability among the panel that developed the reference standard if that sort
of variability seems to you to be typical.
I know radiologists don't agree with each other 100 percent of the
time. I'm not naive but I do want to
know if the sort of differences that we're seeing here if you believe those are
representative.
DR.
CONANT: On a good day or a bad
day? I think definitely. I think they did a very eloquent job of
creating the expert panel and coming up with really the best situation possible
in this case.
DR.
TRIPURANENI: I echo the same comments. As a clinician that is the vagaries of the
clinical practice and I think what they defined as actionable module I think is
probably the best that we can do today.
DR.
IBBOTT: There seems to be agreement
then.
DR.
STARK: One question along those lines,
Mr. Chairman, is that making the most benevolent presumption, I mean, that on
its face it looks like they've done absolutely everything that could be done
but this is a very, very complicated business of selecting images. All sorts of selection biases, even
selecting in the institution and the CAT scanners. There can always be more information on this that I think the FDA
should consider.
We
have an able group of FDA staffers and so I think how these patients were
selected, the institutions that were selected, why not more scans from certain
institutions that are clearly generating them if these were consecutively
obtained.
I
think for any further studies whether they are done for a PMA revision or
post-market surveillance, I think more information on why this number of exams
from these institutions. I think it
should be offered because it will lead to more questions which if nothing else
will advance the science that has already become quite sophisticated here.
The
other thing is that I do believe that we should learn whether the truth in this
case that we are all saying was reasonable when these cases were gathered two
years ago, did it work out that way because we know more, or the industry, the
applicant knows more about these patients today. I'm very keen to know these people with nodules have had
follow-up. These actionable nodules we
have proof on.
I
don't know if I missed it but I would be keen to know how many of these were
reasonably deemed actionable but turned out to be benign and did not change and
did not require treatment and how many that were considered not actionable
turned out to be cancer.
That's
not only important for this product but for its post-market surveillance and
the development of new algorithms for improved products in the future.
DR.
IBBOTT: Are you asking the sponsor that
question?
DR.
STARK: If I'm permitted. I really would like to know what using a
different -- using the real world clinical definition how many of the
actionable nodules were actionable and vice versa.
DR.
IBBOTT: Yes, Dr. O'Shaughnessy.
DR.
O'SHAUGHNESSY: Yeah, I think that's a
very good point. Basically we designed
protocol in consultation with FDA. We
identified who the people were that would qualify for including in the
study. Because of both IRB and other
issues, the sponsor is blinded to who the patients are and the follow-up, we
collected a prespecified certain amount of information.
If
necessary, we can't work with FDA to determine if it's possible and if we could
go forward to find out what happened with these patients. Again, they were collected with a certain
concept in mind to do the study.
DR.
STARK: Thank you.
DR.
IBBOTT: All right. Are we ready to go on to the next
question? Okay. The next question asks us whether actionable
nodules are a reasonable target for a lung CT CAD to be judged safe and
effective.
DR.
KRUPINSKI: Again, I would say it's
reasonable given the caveat that Dr. Stark brought up. If you could follow up on these and find out
if they truly were actionable versus not, that would certainly be a benefit. I think that is the most reasonable thing
for it to be looking at.
DR.
IBBOTT: Yes, Prabhakar.
DR.
TRIPURANENI: It's interesting. It all depends on how you define safety and
efficacy. I think Dr. Stark called it
on this one. As a clinician, to me the
effectiveness is ultimately consulate to whether it has any clinical
impact. To me, it's really up to the
management of the patient and ultimately what he's going to do.
I
really can't answer this question at this point in time because I just don't
have enough information to say that it is actually effective at this point in
time. Yes, the statistics and you
picked up a few extra nodules but I really would like to see the clinical
data. I do understand that's not how
the protocol was designed but I strongly recommend that we really need to look
at the information, what is the ultimate clinical impact and the clinical
significance.
As
far as the safety is concerned, I think Dr. Bill Sacks already raised this
question. I think it keeps bothering me
that even though once the product is approved, if the product is approved, when
it goes into the real world, it's quite possible that there may be somebody
might actually get a little slack and actually not use the proper methodology
that was recommended. That is, read the
whole CT scan unaided before followed by using the CAD system there. If the system is used as it is actually
describe, I think it is actually safe.
But,
on the other hand, I keep thinking whether there is a way you can actually come
back and make sure that the people do it the way they are supposed to do but I
can't think of any other way. I don't
have an answer. I'm just raising the
question. If somebody is not going to
use the system as it is supposed to use it, could it be potentially
unsafe? I don't know the answer.
I
actually have a question for the sponsor.
In your 90 patients, 43 patients had nodules. Were there any instances where the radiologist unaided picked up
the nodule but the CAD missed the nodule completely?
DR.
CASTELLINO: I don't have the number for
that but the answer is, of course, it did.
The CAD system is not 100 percent sensitive unfortunately. In fact, it doesn't mark a certain set of
nodules that the radiologist clearly sees.
That's why it is really viewed as an adjunctive review.
To
sort of get at the prior comment, which I think is a very good one, let me
remind everybody how the radiologist looks at the CT scan of the chest. We give it at least two passes of the entire
image, maybe three. One is what we call
mediastinal or soft tissue windows looking for abnormalities in the mediastinal
chest wall, etc.
One
perhaps is bone windows. Sometimes is,
sometimes not. And one is at lung
windows. At the lung windows we can see
abnormalities within lung parenchyma.
Now, as we look through those 100, 250 images in a melt-through fashion,
cine fashion, I don't know of any radiologists who looks through the entire
data set saying, "I'm looking for nodules.
I'm
looking for airspace disease. I'm
looking for bronchial wall abnormalities.
I'm looking for emphysema."
I'm looking for this, looking for this, and looking for this. But instead we looked at the lung images
globally and we see if there are any features within the lung parenchyma that
shouldn't be there.
Nodules,
infiltrates, pulmonary infarction, etc., etc.
Just that alone means that the radiologist has to look at every lung
image either individually or sequentially in some sort of more efficient
mode. No. 1 is that that's how it has
to be used. In the process a
radiologist will detect nodules.
Secondly,
the radiologist knows that it's not going to detect all nodules. If it ever got to the point of 100 percent
sensitivity, they could use it only the first time as the first reader. We are a long ways away from that. But they still would have to look at all of
the lung images to see everything else.
I hope I have answered that question.
DR.
TRIPURANENI: I guess as humans we are
good at pattern recognition. That's
what I do. Even though I'm a
radiologist and oncologist I keep looking at the CTs and all those things and
we are good at recognizing patterns. I
guess the computer is not quite dead yet.
I
have another question which is the flip side of the other one. How many patients did the radiologist
actually say there are no nodules unaided?
What percent of those patients did the computer actually say there is a
real nodule, that the CAD really helped them to turn the negative nodule
patient into a positive nodule patient?
MR.
MILLER: I think I'm probably the one
with your answer but I didn't quite get it.
Would you mind repeating? I
think you're looking for a fraction but what's the enumerator and what's the
denominator?
DR.
TRIPURANENI: What I'm looking for is if
the radiologist read the scan and he basically said there are no nodules in any
of the four quadrants.
MR.
MILLER: Yes.
DR.
TRIPURANENI: And when you use the CAD
what percentage of those patients were turned into positive nodule patients?
MR.
MILLER: Right. That's this issue of the percent reduction
in misses, I think. In order to answer
the question, you have to make assumptions about what an individual reader's
true threshold would be. We really
can't do that. We can speculate at what
the number would be if everybody's true threshold was 20.
If
everybody's true threshold was 20, then they missed things on the first read 16
percent of the time, then on the second read only 11 percent of the time and
that's a 30 percent reduction. If their
missed threshold was 80, then it's a different number that I don't have at my
fingertips. Is that answering your
question?
DR.
TRIPURANENI: Partly. The absolute number that you picked up was
about 4 to 5 percent. I think the
improvement whether it is 20 or 80 percent threshold is approximately 16 to 28
percent or something like that.
I'm
actually going back to the actual number of patients right there. If somebody has three nodules in one lung,
it doesn't matter if you will pick up two more nodules on that lung. What I'm really interested in is the patient
who never had any nodules in both lungs that the CAD helped to pick up an extra
nodule that would really make all the difference in that particular patient.
MR.
MILLER: I don't know the number on that. I can tell you that there were a fair number
of patients like that. I mean, maybe
about half of the cases in our study only had a single nodule so for that
nodule to be identified caused the ratings to go up. Again, I don't know the percentage but there were quite a few
cases like that.
DR.
KRUPINSKI: Do you know the flip side to
that? How many of the absolutely normal
patients that the radiologist called normal and then the CAD pointed something
out and turned their totally true negative into a false positive and now you've
got a false positive patient. How many
of those?
MR.
MILLER: I don't know that number. Again, I know that there were patients like
that but I don't know the number. I
would be speculating.
DR.
SOLOMON: I think you're hearing the
question essentially of how do you translate the statistics into clinically
significant issues. That is, changing
the patient who is negative into a positive or whether it is significant if you
add one nodule and the sixth nodule.
MR.
MILLER: I am hearing that and I think
that is something we can probably work with FDA on from the data that we
collected.
DR.
CONANT: May I just say something
quickly in terms of answering this question?
I think actionable nodules are really the target that we have
clinically. It's wonderful to look for
a two-year follow-up or biopsy proof but that is not what the task is at
hand.
It's
are we going to say short-term follow-up.
We need that stuff eventually and, yeah, we're all curious about it but
in terms of the detection task, it's really an actionable nodule. I agree that this is a good target but,
again, I'm concerned about looking at the data by patient, not just by nodule
or quadrant because it does make a difference in patient management whether
it's nine or 10 nodules versus zero to one so I agree and disagree with that.
The
other thing I just really quickly need to comment on is this comparison, I hate
to do it, with mammography. But, you
know, I see CAD in place of mammography and, yeah, people cheat. That's not what this is about. This is about marketing and education and
you can't prevent people from cheating.
That's
not really our task here. It does
happen but hopefully, you know, people will be better at that. The thing about a chest CT is that this is
one task in that chest CT that they are being asked and that this company is
addressing so that this idea of cheating, "I'm going to look at the whole
CT but I'm not going to look at nodules until I have my prompt." I don't think we as a panel can really go
there but I've seen it happen. I don't
do it.
DR.
BLUMENSTEIN: How do you cheat?
DR.
CONANT: I actually have not used --
DR.
BLUMENSTEIN: It just doesn't seem --
DR.
CONANT: I have to admit I have not used
CAD in clinical practice. I am waiting
for it to come off the direct digital images in my clinic. It used to be you digitize the images, you
had your film screen there, and you pushed a button and your little prompt came
up and you didn't have to wait until after you saw the images.
You
just pushed the button and you never had to look at the images. Your answer was there. Now, one think that has, or I think
potentially could be built into a soft-copy review of digital mammography and
chest CTs is a lag time before the information is available, or the requirement
to go through the image with multiple window levels and mediastinal and all
that other stuff that chest people do.
Potentially
in mammography to prevent cheaters you could say, "Okay, you've got to
scroll through every image on all the resolutions and stuff before your CAD
prompts will come up." Again,
that's not what we're being asked to create a safeguard against cheating here,
I don't think.
DR.
SOLOMON: It's important for safety
issues and maybe even a warning that you had to click before you actually --
you know, just a reminder to the average user that this is something that could
be dangerous unless you looked at the scan already.
DR.
CONANT: But that's education and
training and eventually you're liable anyway.
DR.
FERGUSON: My question is tangential to
this because as I listen to you describe the instrument and its use, you said
that -- I thought I heard you say that the radiologist had to go through the
scan before he could click on your button.
I
mean, is there a fail safe there which keeps the radiologist in -- little or
none of these people are around, you understand, but where he could go in and
click and get your imaging for the whole lung scan for nodules and then use
those as his reference points?
DR.
IBBOTT: Actually, Dr. O'Shaughnessy, I
was going to invite you to come up and your colleagues to come up to this table
so you don't have to keep jumping up and down.
If you pull up a couple more chairs, perhaps three or four of you could
sit at that table.
DR.
O'SHAUGHNESSY: Thank you.
DR.
CASTELLINO: I perhaps was misleading
when I made that comment or wasn't understood.
First of all, let me emphasize there is no fail safe mechanism. We thought about building that in in some
fashion. We feel that labeling and
training will address it. There are
work-arounds if you made everybody look at the lung windows first.
You
go through the whole lung windows and push the button so, I mean, we are very
-- radiologist are very clever people but I don't think it would work. What I was trying to get across -- I see you
would agree with me.
What
I'm trying to get across in looking at, you have to look at all the lung
windows for a whole host of other abnormalities that are within the lung of
which nodules are one feature, let's say, of maybe eight or 10 features that
you're looking for. Even if you push
the button first and said there's a nodule or two, you still are required to
look at everything because you have to do that.
I
think radiologist will use it -- will be more likely to use it in the
prescribed fashion. With mammography
it's different with the CAL code being about 98 percent accurate, it's almost
approaching 100 percent, yes, I think some radiologists probably do use it as a
first reader for CAL but certainly not for masses.
DR.
IBBOTT: Thank you. It, again, appears that we have consensus on
this second question, that actionable nodules are an appropriate target for
this question.
So
then the third question is the achieved gain in ROC performance demonstrates
safety and effectiveness of the CAD.
We've already been discussing this to some extent. Clearly it does seem to depend on how
rigorously the radiologist followed the always and the never rules.
Being
people I'm sure that not everybody will always follow the always and never
rules. The question is has the company
done the appropriate things to encourage people to use this device correctly?
We've
seen some of the information that they have provided us today and there is a
fair amount more in the information we've reviewed with the labeling that
describes the warnings. I would like to
ask how you feel if you haven't already volunteered your opinions about the
labeling and the adequacy of these warnings if you consider that they are
acceptable.
I
don't mean to swing us away if you view that question as asking something a
little different, but certainly I think that the safety question is at least
partly dependent on people following the never rule, not changing their
diagnosis based on the response of the CAD system.
DR.
CONANT: Just real quickly, I'm very
positive about the first two. This one
I have problems with, though, because I don't think that we've really
definitely showed the effectiveness without looking at this by case. You're actually specifically asking her
about ROC performance as the measure of effectiveness. Until I have it broken out by patient, I'm
not really sure of that.
DR.
BLUMENSTEIN: I see there's two measures
we have. We have ROC performance which,
I think, is a measure of device performance.
Then what we've been talking around and we all seem to have some degree
of discomfort with is whether it performs clinically the way that we would
expect it to or would hope that it would.
I
have misgivings about whether the ROC performance measures are accurate and I
have expressed those but I definitely have issues about whether there's
clinical safety and effectiveness demonstrated because we don't have measures
of confidence bounds on sensitivity or any other kind of measure that shows us
an estimate of the clinical efficacy.
Now,
I don't know whether the FDA is inclined to give a device approval based on
device performance or whether there is a need for demonstration of clinical
effectiveness. But as a panel member
given the data that I have, I have to say that the answer to C is no for me.
DR.
SOLOMON: I have two questions for you
that are related to this. The first is
that we weren't presented with any data on reproducability of the system. I don't know if you have anything to say
about that. If I ran an R2 on the same
scan or same patient, is it going to always give the same result?
DR.
O'SHAUGHNESSY: In this particular case
-- this is Kathy O'Shaughnessy -- the images are digital images so the
algorithm will perform exactly the same on the same digital image. Reproducability isn't an issue.
DR.
SOLOMON: Okay. And then the second question has to do with
the fact that I guess you are currently selling the product in Europe and I'm
not sure how many months now it's been that way but do you have any feedback
from the physicians in Europe who are using the system? How is it working as far as safety and
efficacy goes?
DR.
O'SHAUGHNESSY: It hasn't been on the
market very long in Europe so we only have a limited number of sites. In terms of safety there's been no adverse
events certainly that have occurred with the device. I believe that physicians are very happy with the use of the system. They are not collecting clinical data, as
far as I know, that could be supporting this application.
DR.
SOLOMON: Do you have any post-market
studies of data that you are collecting right now in Europe?
DR.
O'SHAUGHNESSY: No, we're not.
DR.
IBBOTT: Yes.
DR.
TRIPURANENI: Regarding the clinical
effectiveness, even though that is not the topic of the discussion, we heard
from Dr. MacMahon about what he felt about this. I would like to ask, if the Chairman lets me indulge, Dr. Delgado
about his particular clinical impressions.
I'm
not talking about the protocol per se.
What is your feel having looked at 20 or 30 patients in your
institution? Do you think it's going to
have an impact on the clinical practice?
Perhaps it's not a fair question.
DR.
DELGADO: Well, we did not do a
dedicated analytical study but we did get basically comments from different
radiologist of which I'm one of them that worked with them. We do handle a large volume of CTs per day
and multi-slice CT cases.
Like
I said, most radiologists found that there were nodules that we missed and
increasing nodule detection is something that I think is only a good thing so I
think it's effective in terms of what it's stated to be, that is, increasing
detection rate of nodules. I felt that
it's effective in what it's stated to do.
DR.
STARK: If I can touch on a couple of
things on this one seed. I mean, we see
effectiveness where the radiologists are limited so much in their tasks and
safe because they are constrained to just looking at airspace without the
distractions under these conditions that we all agree are designed to ask a
very focused question designed for this ROC study. But we don't know if the radiologists given whether -- I
certainly agree with Dr. MacMahon's suggestion that a more reasonable study
group would have 80 out of 100 scans be completely normal and maybe 18 out of
that 100 have some other abnormality like COPD, some atelectasis or pneumonia
or pleural effusion. And 2.7 out of
that hundred should have perhaps solidary pulmonary nodule because we can make
arguments here that you have perhaps undersold the technology that it might be
particularly useful at helping the radiologist find a needle in a haystack when
he's distracted, but it also has to show that is the efficacy argument that has
not been proved and it might be better than what you say. It might be worse. The safety argument is under those conditions can you prevent
these radiologists from falsely causing additional scans, biopsies, etc., to
fight off these false positives when you do have to look at the mediastinum and
there is an infiltrate and there is some adenopathy or some post-operative
changes. That's one issue in terms of
the study population.
I
also wanted to mention that I think my colleague, Dr. Solomon's reproducability
question is particularly important.
What happens after a patient has been operated on? We all agree that the computer is going to
run the same file the same way twice absent, again, your flash card got
overloaded with photographs of mars.
But what about the patient who is scanned on another day and breathed
differently or had their arms by their side or had a contrast injection? There must be data available to you that doesn't
even require -- each patient serves as their own control. I mean, just go into the archives at
Sloan-Kettering and you can come up with 100 scans digitally, run them through
your computers, and show here are patients where we have six scans. We have 100 patients that have had six scans
and how many of those, if it's 20 percent have an abnormality, did this machine
treat that abnormality. That is a very
simple, not labor -- not even -- there's no physician work at all. That would really answer the reproducability
question in a clinical context and it would show that doctors can rely on this
from day to day.
Lastly,
I am concerned to hear that this product has been in Europe. Clinical radiologists, especially when
something is this -- like surgeons deciding lap cholecystectomy works. It's good for patients. We decide based on word of mouth, anecdotes,
and I very much appreciate Dr. Delgado's excellent presentation of his
anecdotal experience. It brings this to life but where are
the European papers saying, "This has changed my practice. This has made my life easier. I feel more comfortable." There are usually anecdotal reports at
levels that have a less of a standard than we have here like the RSNA or
national meetings and why aren't they appended as written testimonials at a
higher level than, forgive me but, you know, from one user at a beta test
site. Where are the published
testimonials or anecdotes or clinical case reports in the literature of Europe?
DR.
IBBOTT: Yes, please.
MR.
MacMAHON: I think there were a number
of issues. One was a suggestion of
doing the observer test in a different way, perhaps with more normals and with
multiple kinds of abnormalities in the spin.
I agree that would be ideal in a sense.
I
should point out there were multiple abnormalities in the scans that were
used. These were not just pristine
normals versus typical nodules. I
think, in fact, you saw in the really typical classical nodules the results
were much more impressive.
A
lot of the disagreement among the radiologists in nodule detection, I think,
although I didn't participate, was not so much is this a nodule or a
vessel. It was does this qualify
according to these very specific criteria as an actionable nodule above a
certain size and above a certain density, when does it become a scar or when
does it become an airspace opacity.
Those
are the things we struggle with every day.
That was partly a matter of definition.
But I think the mix of normals and abnormals was used to maximize the
statistical power in the experiment. Of
course, one could do more ideal experiments if time and money are no object but
this was already pretty extensive. I
think that was probably a reasonable approach.
There
are some other issues. Perhaps I'll
have the other people address them.
DR.
CASTELLINO: Well, just a couple
comments. It turns out, it just so
happens, that half of the patients in the 90 group study, 45, were done with
bolus IV contrast injection and the other half were not so we didn't design it
that way. It just happened to fall out
that way. We saw no difference in the
appearance of the nodules. In fact,
with contrast you may expect some of these things might be easier to detect.
To
answer one of the questions, I would like to reemphasize we didn't cherrypick
for clean lungs. We had an independent
radiologist come in and rate the lungs as clean, intermediate, or dirty. I don't know the exact numbers but I think
something like 15 percent would be dirty lungs, about 30 or 40 percent
intermediate, and the other whatever remained would be relatively pristine
lungs. As I said before, a number of
these patients did have prior surgery or radiation therapy. They were included in the study group.
I
would dearly like to go into Sloan-Kettering's Radiology Department or any
other radiology department and get a bunch of cases like I used to do and do
clinical studies. I can tell you trying
to get cases from institutions to do this type of research work is
extraordinarily difficult. I know the
academic community is very aware of this.
We are trying to develop both databases so everybody can have access to
it. Let me tell you, this is not a
trivial issue. To identify these five
sites you've got the cooperation from these people and it is extraordinary and
we are deeply indebted to them. I think
your suggestion is great. You get me
the studies and we'll do the research on it.
Reproducability. I think the issue with mammography, and I
don't like to keep bringing this up, but when you're scanning a film the noise
within the scanner, the digitizer, is a problem with reproducability. We've done those with film base
studies. With a digitally acquired
image, there is no issue for the algorithms since it has always worked on the
exactly same digital data set.
Going
from one patient to the next, it all depends on how that patient is. Two days later the patient may have motion
artifacts and what not. The CAD
obviously will perform different on that type of case material.
Lastly,
there are some reports that were presented with this product. I'll be glad to get them together and ship
them out to you guys to take a look at.
They are all, of course, retrospective studies looking at cases where
red is negative, reviewed in retrospect to see were there nodules in the lung
and CAD identified a number of nodules.
One
comes out of Brigham at Harvard.
Twenty-two percent of the cases were negative for lung nodules, not
other abnormalities. They found nodules
that they felt were important to recognize in retrospect, 22 percent of the
cases. Oh, I just answered that
question. Okay.
DR.
IBBOTT: How about Nancy since you
haven't said anything.
MS.
BROGDON: I just wanted to comment. When you mentioned shipping some information
out, please make sure that anything that you submit comes to the agency
directly. Thank you.
DR.
CASTELLINO: Absolutely.
DR.
IBBOTT: Dr. Krupinski.
DR.
KRUPINSKI: An issue sort of following
up on what was already brought up. Not
reliability but engendering trust in your users. I notice that when you're reporting the false positive rates on
the stand alone you report median. Now,
typically median is used when you have a skewed distribution so I'm assuming
that you are negatively skewed and your false positive rate the average was
higher than the median. Could you tell
me what the average was, was it skewed, and then the range of false positives
per case. Not just the median because
two to three median most people are going to look at that and say average. I think it might be a little bit misleading.
MR.
MILLER: I agree. People use the word
average sometime to mean either a median or a mean and I think we have to be
very careful not to refer to that number as an average because the distribution
is skewed and the median and the mean are different. Because there are
some patients that could actually have 100 nodules, we don't have a cap on the
number of marks. The system could
actually find 100 true positives on a given case so we actually do have one
case out of the 151 that had 47 false marks.
Now,
I think on that case when people hit it they sort of just ignored all the marks
because it was just obviously a very, very dirty lung. I don't know the number off the top of my
head but I think the mean false marks is four if we are defining false as marks
that were not panel findings at all.
If
we include some of those equivocal findings, the one-thirds and the two-thirds,
I think it may go up to five and the number is different if it's false marks
per normal case or total number of false marks. That's in the ballpark of what it is.
There
are actually a fair number of cases with zero marks. A lot with zeros and ones and so forth so that's where it goes
back to your other questions about correctly localizing.
DR.
KRUPINSKI: This is unrelated but did
you look at the stand-alone performance was very different from the ROC
analysis? You broke this now into
classic versus nonclassic. Did you go
back and look at the performance data of the observers using that breakdown
instead of what was used?
MR.
MILLER: Yes, we did, using a cut point
of the four-fifths classic so if you -- we don't have the distribution here but
it's sort of split out neatly that people are more likely to be on one end or
the other so using that four-fifths definition you actually get more of a
separation of the curves than we do with what we showed you. I think that is essentially that we have a
higher true positive percent and people are reacting more often to it.
DR.
IBBOTT: Okay. I think then we'll go on to the next question. Question No. 2 then is please discuss
whether the labeling of this device including the indications for use is
appropriate based on the data provided in the PMA. This is, again, on the question of are the instructions for use
and warnings about the always and never rule sufficient. Maybe we have discussed that enough. I'll see if there are any comments from the panel.
DR.
STARK: I have a few. If people could turn to Tab 8. I'm not sure if I've directed myself to the
most important place but this is where I've taken off. I think, by the way, since I'm a primary
reviewer I should fill in some of these things. Suffice it to say I
would like to conclude
-- I conclude from the discussion that I've heard
today that the word "significant," that if this product is approved
now or in the future, any claim to significance really should be toned
down.
I
don't know -- I'm not trying to lawyer anybody here. I know there are people in the FDA that know how to do this but I
would be offended to see the word. I
think there is a future for this technology.
I'm not sure today is going to be the biggest step forward but it's
definitely positive or negative result in terms of approval.
This
is a step forward because there is going to be this technology but I do not
think we are close to where I would feel comfortable being part of something
where a radiologist is told that this product makes a significant difference. I think this is an aid like a better light
bulb in a view box.
I
mean, I think it should be -- if you are allowed to sell this, I think the word
significant should be in a footnote and only when it's within two words, if you
put it in Google, of the word ROC so that we have a significance statistical
ROC result in a footnote.
But
to tell radiologists this is going to make a significant difference in their
practice or significantly help their patients, I think this panel and everybody
who has been candid have labored mightily to say that is not a correct claim
and it would be misleading. I would
rather be on the plaintiff side of a malpractice suit related to that.
Similarly,
for example, some of the language that I would use as an example, and, again,
I'm not trained in this and forgive me for being blunt. I'm just trying to help because I'm
presuming in these comments that there is something to be decided here and
we're just talking about language.
The
phrase under efficient detection of lung nodules, paragraph 2, second
sentence. By the way, here is an
example of the confusion. You have
clinically significant nodules here and elsewhere the word significant is used
and we talk about it being loaded, spun, twisted by our presumed innocence but
marketing people will get carried away and you would be on the edge of fraud
just due to concatenation. So forget
about that word significant, but high sensitivity and low false positive CAD
marker rates, I do not see how someone can make that concatenation. That is just to me a little bit too artful.
We
have a very high rate of false positives with CAD. I mean, to characterize what we are having as CAD marker rates as
low false positive is the exact opposite of the truth. Again, this is my opinion. I would love to parse the language if that
is what we are supposed to do here.
Let's
see. In terms of improving sensitivity
and efficiency, the sensitivity argument, I think that may pass mustard with an
asterisk. I don't know that we've shown
there is any increase in efficiency at all.
I really don't. I think we have
said basically to the radiologist read it again.
I
would like to -- I appreciate the back and forth and I think everything Dr.
MacMahon said is correct and everything that I have said is correct as we,
again, focus people on this. You are
redirected to a single slice and perhaps the computer work station, whatever it
cost, leads you to that slice but no radiologist is going to decide real or not
real based on looking at that one slice.
They
are either going to tile up the adjacent slices until they are fully through
the lesion, or they are going to trackball through it and in most cases human
nature you are going to trackball through a significant fraction of the images.
All
I can say is touche, back and forth on this.
You are not just going bing, bang, boom, there are three slices it
picked out. They were all obviously
nothing. No way. No way at all that's going to happen. You are going to trackball through it and
that's going to take time. I think the
efficiency claims really would have to go.
On
the next page where it says, "Automatic CAD processing or lung nodule
detection requires no user interaction."
Again, please, my opinion is that I know some person probably was just
being enthusiastic but this requires that the radiologist be responsible for
dealing with this snow storm of false positive exams.
It's
the worse kind of user interaction.
It's the kind of user action that causes radiologists to stop doing
mammography or to leave the field entirely.
It's like I'm going to say there's all these positives here and I'm
going to be a malpractice lawyer's dream.
Now you have to bat away all of these snowflakes and take the time to
interact.
Definitely
have to interact and take the time to do it and be liable. I doubt this is something that should be
considered here but the affect on people's ability to read, the psychodynamics
that produces these ROC curves, that produces radiologist's performance really
is largely affected by people's anxiety and I know there are people here that
are expert on that and I'm not.
But
I think it's going to make people very edgy and it's going to have a lot of
unintended consequences that they are going to be thinking about what's the malpractice
lawyer going to do with or without application of this approved
technology.
That
alone might have a bigger affect on reader performance. Those of us who don't have the machine will
be more careful and those that do may or may not be more careful. I think the labeling and the training is
extremely important. I know we'll get
into the training next.
DR.
CONANT: May I say something real
quickly?
DR.
IBBOTT: Yes.
DR.
CONANT: Just a little rebuttal there,
Dr. Stark.
DR.
STARK: Please.
DR.
CONANT: Sorry, David. From experience in breast imaging, I just
have to say two marks is not a high false positive rate. When I'm looking at the task at hand which
is 300 images, I don't know that's a high false positive rate until I know how
it impacts a single patient.
It
doesn't sound that bad to me compared to what we're doing with mammography and
where we've come and where we're going.
I don't think you can jump to say -- I mean, I think I agree with all
your other things here but I would be hesitant to say that's too high until we
have the data because it doesn't sound that bad unless it impacts those single
two patients where there are those two false positives.
DR.
IBBOTT: Yes, Dr. Ferguson.
DR.
FERGUSON: Speaking of the labeling, I'm
looking here and I'm sorry. I
apologize. I have gone through here
several times -- not talking about the advertisements but the manual that you
have -- looking for clear definition of what we saw on the slides which is what
I think should be somewhere in here up front, and that is the two slides which
we showed about what you must do and what you must not do to use this
device. Is it somewhere in here?
DR.
IBBOTT: You're talking about the always
and never rules?
DR.
FERGUSON: Yes.
DR.
O'SHAUGHNESSY: I agree it's very
important. We're looking for the advice
of the panel on this issue and labeling in general. I should comment that particular situation is at the front of
your Tab 4 where we've got preliminary warnings and poshuns that would be given
to the radiologists. That's where in our mammography product we
typically -- these are gone through during your training session to make sure
that the information gets across.
Again, we would look to work with FDA with your advice from the panel to
come up with appropriate labeling for the device to affect both the manual and
any advertisement labeling. That is
part of what the job is when we finally work with FDA and get a final labeling
for the device.
DR.
SOLOMON: Two other quick questions on
the labeling. One is whether vendors
matter. I mean, you have two
vendors. There are several others out
there and whether or not there's any impact on your system. The second one, as far as labeling goes,
whether or not there's an optimal slice thickness and whether or not that
should be stressed because maybe the protocol should be changed to optimize
your system.
DR.
O'SHAUGHNESSY: I can answer that at the
high level. If you want to go into more
detail, I have the technical people here.
Although at the five sites we chose to select cases for the regulatory
study, they happened to have scanners from the two vendors mentioned, GE and
Toshiba, are separate database cases that was gathered for training the
algorithm has representations from all the major CT vendors.
In
addition, as part of the approval for a CT machine there are very rigorous
controls on the quality of the images.
Those type of controls more than adequately make sure that the images
are adequate for CAD. I believe that's
okay. The second question again? I'm sorry.
DR.
SOLOMON: Optimal slice thickness and
protocol design for optimizing your system.
DR.
O'SHAUGHNESSY: Right. Because the system was designed to address
the issue, especially in an information overload situation, we focused the
development of the algorithm for slices of 3 mm. collimation or less.
In
fact, the system won't process CT images unless they have collimation less than
that. Part of it is that's where
radiologists are most likely to miss.
The other factor is it's a more volumetric description of the lung and
so the algorithm is designed to perform in that environment.
DR.
TRIPURANENI: I heard Dr. Emily Conant
loud and clear that it's not our business to actually decide how the user is
going to use the system, but I think I have to agree with Dr. Ferguson. I really would like to see in big letters
always and never somewhere loud and clear.
When
you look at this fancy color graphics, for somebody not paying attention it
looks like you can push the bottom and the machine is going to tell you
everything even though it says "improves" and all those things but I
think those two points need to come out loud and clear.
DR.
STARK: Is there anything in here to
give comfort to a radiologist once this product is approved for not buying
it? Is there any justification for not
feeling bound to use this in every patient whether they have pneumonia, they
are in for a car accident, follow-up on a pleural effusion? I'm
wondering what type of marketing pressures that we haven't yet seen are going
to drive people to feel that they will be left as a wounded calf behind the
herd for the malpractice lawyers if they don't take on the burden of using this
product for every CAT scan done in America after the FDA gives this it's
imprimatur.
DR.
CASTELLINO: I thought I got two
questions there. One might be, I think,
if you have this in your department would you choose to use it on patient A and
not on patient B. If they meet the
technical requirements, the CAD works in the background.
I
think it takes an average of three to five minutes to process the images for
the CAD results. If you're reading in a
standard fashion, which is not really that much on line, the CAD information
will be available to you. You can
choose to use it or not to use it.
My
suggestion as a radiologist is if it's there and you think it's worthwhile
since you have acquired the technology, you probably should use it in every
case but this is up to definitely the person who wants to use it.
The
second question is a little more difficult to address. I think your question is really saying if I
don't have one should I get one. Our
experience with mammography and, Emily, I hate to go back to that but I guess I
have to, is that the utilization of CAD mammography, which has been approved
five and a half years ago or more, has been relatively slow.
I
mean, there are many mammography practices that don't have it. In fact, apparently you don't have it. I don't think this is going to force
radiologists to get it or not to get it.
Just like a 16-channel CT scanner is not a necessity if you're doing CT
if you have an 8 or a 4, and some people still have a single slice scanner.
Or
having all the probes and ultrasound machine or having all this or all that
radiology programs make decisions on what technology they wish to hire. If they think this is valuable, it will help
them in their practice, they will acquire it.
If they don't think it's any good, they won't. I think the marketplace will decide.
DR.
STARK: Shouldn't the labeling of
products like this -- this is perhaps a broader question but I think it
pertains here -- contain disclaimers so that someone does not make inferences
about the standard of care or what is the required minimal diligence of a
physician or a hospital who chooses not to be an early adopter of this
technology.
DR.
O'SHAUGHNESSY: I think that would be up
to the panel to discuss. Again, if
appropriate labeling is found to be important for this product, then, you know,
we'll work with the FDA to include it.
DR.
KRUPINSKI: Sort of a tangential
question. With mammography now when you
use CAD you get extra reimbursement above and beyond. Do you foresee this happening with this as well?
DR.
O'SHAUGHNESSY: I think it's a little
early at this stage of this technology to figure out what the reimbursement
situation will be.
DR.
IBBOTT: Let's move on then -- oh,
sorry. Go ahead.
DR.
CONANT: Can I ask just a real quick
technical question? Maybe this is very
naive and I didn't understand your illustrations but does the algorithm that
analyzes the images, does it come -- I guess can I hook up lots of scanners to
it? Is it one box for each scanner or
is it one box for each department? I know
there are issues with mammography. I'm
just curious.
DR.
O'SHAUGHNESSY: In this situation
depending on how many CT images you are going to feed through, the fact that
we've utilized the DICOM standard means it's just an appliance sitting on the
network so you just push them from any scanner available in your system and as
long as you don't exceed the computing capability of the computer to keep up
with your case load, there is no restriction.
DR.
IBBOTT: Well, that's brings us to the
question of the training program, No. 3.
Please discuss whether the sponsor's proposed training program for
radiologists is adequate. If not, what
other training would you recommend? I
would like to start by asking my question about that. I couldn't find anything in the material here that provided a lot
of detail about the training.
In
particular, how long the training is and how closely supervised it is. You presented a bit more during your
presentations but I wasn't sure if that was the type of training you would
propose for customers or if that was training for the people who were doing the
evaluation.
DR.
O'SHAUGHNESSY: I think that is a great
issue and good question to bring up. We
didn't have the formal training program written up at the time we were submitting
the PMA and part of the goal of the training at institutions like Dr. Delgado's
was to take a first run at it, assess what changes needed to be made, and then
bring that forward.
So
the format that he described, it was very similar to what we ended up with
which is basically depending on the number of radiologists but typically a site
would have one of our specialist there for a day. They would work with the radiologist one on one to go over the
manual, in particular the algorithm description.
Every
system that ships will have demonstration cases that are good examples of what
CAD marks and what it doesn't mark and the type of false markers they are going
to see. And then as the radiologists
get more comfortable with the system, the shadowing that we talked about where
they are there available to answer questions like the radiologist is reading on
their own but go, "Why is that mark there?" The applications
person can answer that. Then in
addition to that, the application specialists usually follow up with the site
within a week or two or that training to make sure that no other issues have
come up. Of course, we are always
available by telephone or e-mail if any issues come up. The general outline of the training program
is similar to what we do in mammography and we found that to be very effective.
DR.
KRUPINSKI: As sort of a follow-up, Dr.
Delgado said that some people weren't there for the training and then some of the
other radiologists trained them. Is
that enough? Is that acceptable? Because obviously I wouldn't think they
would be able to answer some of the more technical questions so how did you
feel about that?
DR.
DELGADO: That's a good question. We were able to do it quite readily. The training experience that I had with the
application specialist was really just three or four hours in the morning. We had some lunch, they were around for the
afternoon and stuck around and watched us read and shadowed us.
I
think that perhaps that is something that R2 if they want to actually mandate
that positions go through the training in that fashion or some kind of course
or improvement period. That was not
strictly applied in my case as a beta experiment but I see that potentially
being used in clinical practice. That
is probably a good recommendation.
MR.
BURNS: A time is not given. You used eight hours. I would suggest a super user trained at the
facility and the production of a training CD so that even though you have new
radiologists and staff coming on board, training CDs are not that hard to
produce and you have your own project.
Three to four hours sounds about right to teach someone how to use this
work station.
DR.
DELGADO: I should add that is something
that we went through. At least the
physicians that did receive the course or the small introductory application
seminar. We did process, I think,
relatively about 15 or 20 cases, some of which were provided by R2 and some of
which were from our institution. That
is some kind of case load that should be either already prefixed or from the
institutions. Definitely valid.
DR.
STARK: I think the most important part
of training is going to be identifying what causes these false positives and
cataloging them because there are going to be -- there's going to be a pattern
and frequency of artifacts or anatomic coincidences that probably the company
already has some good idea what they are that are going to be very different
than the false positives that we train our residents to recognize in the normal
practice.
The
false positives that the radiologist has to fight off on his own going through
the studies are likely to be a very different mix of appearances and locations
than the false positives that you are going to see with the device. Also with and without contrast. We have heard that 50 percent of these
patients had contrast.
It
would be reassuring to actually just see it written down if it's subject to
analysis that there are no unique issues post contrast. So in your educational material it might
even -- one could even say that someone has to deal with that at the PMA stage
but we should see atlases or a CD.
It
may not be extensive. It might just be
10 appearances. You have some examples
already in the PMA. These are the
things that you can expect that you're going to see 80 percent of the time in
eliminating these false positives and let's see the 10 or 15 most common
variants. A radiologist would train on
that in an hour. I think that is an
important supplement.
DR.
O'SHAUGHNESSY: Yeah. I think that's basically -- maybe I didn't
explain it clearly enough but that is basically what the manual does is it goes
through examples and then we use those demonstration cases that were chosen to
give a representative and range of the types of both true and false positives
that you see on CAD.
DR.
IBBOTT: Dr. Delgado, in your experience
with the system did you and your colleagues -- I guess I should ask how long do
you feel it took before you became familiar with these sorts of presentations
of false positives? Did you find it a
complicated process?
DR.
DELGADO: No, I did not. First of all, one of the comments by Dr.
Stark was in my experience in the cases that we processed from our institution
many of them were CT contrast-enhanced pulmonary angiography studies. Many of them were contrast enhanced.
And
we also had many cases that were for lung nodule workups in oncology patients
where lesions were detected in chest x-rays. We noticed no significant difference in false positive rates based
on contrast or no contrast.
DR.
STARK: When you say we noticed, you're
talking about an anecdote?
DR.
DELGADO: True. That's my experience and those are my
colleagues. As far as the false
positives -- is that your question? -- recognizing artifacts or false
positives, I believe that -- I mean, those are normal things that radiologists
have to look at now on a daily basis.
We
have artifacts that are either generated from noise or from post-operative
changes, from other technical parameters such as contrast coming into the SBC
and being rather dense. I don't see a
particular difference that the CAD would present perhaps a false positive mark. The radiologist decision making upon that
CAD mark is no different than something that he might have identified
himself. That's my perception of the
issue.
DR.
IBBOTT: Any other comments about this
question before we go on to the next one?
All right. We'll go on to the
fourth one.
MS.
BROGDON: Dr. Ibbott, could I ask the
panel to go back to question No. 2, please?
Part of our intention in asking this question was that the panel also
address the indications for use. Do you
believe as a panel that the requested indications for use are appropriate?
DR.
IBBOTT: And you are referring to the
published indications from the sponsor?
MS.
BROGDON: That's right.
DR.
CASTELLINO: Tab 1, page 1.
DR.
IBBOTT: And this is being presented
also in the sponsor's presentation. Are
there comments about this? From a
physicist's point of view it seems straightforward but perhaps that's not the
appropriate -- I'm not the appropriate reader for this. It's the person who would be using the
system.
DR.
SOLOMON: This may be a good place to
include the always and never thing that we've been talking about. I don't know if this is the appropriate
place.
DR.
KRUPINSKI: This is also why it would be
interesting that we could have seen the difference between the classic and the
not classic. Here you're talking more
about classic nodules and performance based strictly on those to see if this
truly is appropriate.
MR.
MILLER: Just to clarify quickly, the
primary analysis is based on all unanimous nodules. It was one of the sensitivity analyses that --
DR.
KRUPINSKI: Right, but not all of those
were classic.
MR.
MILLER: That's correct.
DR.
SOLOMON: The only other thing, I guess,
is to possibly emphasize the fact that somebody doesn't realize that ground
glass nodules would not be included in this.
If I just read it kind of casually, it's a solid pulmonary nodule, I
might think all nodules would be included, whereas you might want to
distinguish the fact that the system is not meant for ground glass nodules or
other things.
DR.
KRUPINSKI: You mentioned satisfaction
of search here and I'm just wondering if there is a reverse. You are going through -- there's all these
other abnormalities. You note, yeah,
there's atelectasis back here. Then you
go and you bring up the nodules. Has
anybody looked at the possibility that you are going to get a reverse SOS and
now you're all concentrated on the nodules and you forget to report the initial
findings. Has anybody looked at
that?
I
mean, if you're not going to give your report, you know, if you're not going to
sit there and dictate before you look at the CAD, there's the possibility that
now you're all wrapped up in the CAD and all of a sudden the other stuff goes
out of your mind. Clinically do you see
that happening?
MR.
MacMAHON: Well, I haven't actually used
this system so I'm just speaking from general experience. Of course, in reading CT scans, as I think
Dr. Castellino described, we go through it multiple times already.
We
go through the mediastinum and personally I make notes, or my resident makes
notes as we go through because it's really hard to remember all of the
abnormalities in all of the areas so I take a second run or a third run and
look for pulmonary nodules and abnormalities and make more notes. My instinct is that it would not be an
issue.
DR.
TRIPURANENI: As I read through this
again, I guess now that we have Dr. Stark's comments and others, the second
paragraph is an interesting paragraph.
I don't want to wordsmith. That
is certainly not my expertise.
If
you look at the first sentence in the second paragraph, it kind of vents with
the other recognized causes of a suboptimal view. I'm just raising the question.
Potentially a radiologist could actually bar the system by looking at
the indications. I'm not saying he
will. The hole in the whole system
there he could actually say, "I can slack off a little bit.
The
system is going to pick up the nodule there." I think once again always and never are very important to really
put it on the face kind of stating it every single time. The whole system is predicated on those two.
DR.
CONANT: I think there could also be
further contraindications in the warnings and precautions. I mean, again, just emphasizing the always
and the nevers but I'm not sure you can dictate what people actually do.
DR.
STARK: But isn't it fair to say that
given the combination that they are making a claim here that it relieves you of
fatigue and distraction or other recognized causes of suboptimal review. I mean, these are bold statements that are
going to be used by marketing people to radiologist to look at this.
DR.
CONANT: Where does it say
"relieves?"
DR.
STARK: I'm sorry. It lapses.
I misconstrued it. The chance of
observational lapses by the reader due to fatigue. Well, the next patient that the same radiologist read after
having to deal with these false positives, one could make an argument there's
more risk to the next patient.
DR.
CONANT: One way to deal with this is
basically the second paragraph nobody really likes a lot because who wants to
read about our lapses and fatigue, right?
Maybe that's not necessary here if always and never is emphasized. Is that happy?
DR.
STARK: I think if the FDA has our point
that we are unhappy with the language, I'll leave it at that.
DR.
CONANT: We don't like to be called
tired and distractable.
DR.
IBBOTT: Mr. Burns.
MR.
BURNS: If I remember correctly earlier
during your presentation, you indicated this algorithm does not work with low
dose chest CT. Correct?
DR.
CASTELLINO: No, I did not. I said that the clinical cases that were
collected for the ROC study were all clinically indicated studies. That is, they did not contain any type of
screening low-dose exam. In out test
database a substantial number of the cases are, in fact, low-dose CT scans and
performs quite well in that, or equivalently well in that. But specifically for the ROC study they just
happen to be clinically indicated exams like you see in most hospital practices
or out-patient practices.
MR.
BURNS: Okay. So what you have in the warnings regarding the MAS levels covers
that issue. Correct?
DR. CASTELLINO: Correct.
DR.
IBBOTT: All right. Then let's move on again to the fourth
question. I think we have an indication
where we're going on this one, too. If
the PMA were to be approved, please discuss whether the above or any other
issues not fully addressed in the PMA (A) require post-market surveillance
measures in addition to the customary medical device reporting. Several people have suggested that they
would like to see additional studies done if this device were to be
approved. Those of you who have called
for that, would you like to elaborate?
DR.
STARK: Well, I've mentioned -- actually
seen data. I'm not inclined to argue
with the perceptions because I think it's likely correct that low-dose contrast
but the public needs to see this. This
needs to be written down somewhere so it's objective and hopefully some
statistics can be applied to it.
Artifacts
due to common thoracic interventions such as excision of one of these nodules,
a clip left behind, radiation and damage, patients who can't put their arm over
their head. I think those are the major
things that are medical in nature. I
think one of the things -- there needs to be something negotiated with the FDA
in terms of minimum.
You've
got already minimum CAT scan or technology but as CT technology evolves what
would trigger a change in surveillance.
It may be a different category but under this if this PMA were approved,
again, the technical experts at the FDA need to negotiate what is some minimum
quantum change in the technology that would require a new PMA and review. Is it going to remain a class three device
or what would it be? Is it going to be
a 510(k) application of substantial equivalence?
Again,
I alluded to earlier I don't know what algorithm is used here and I'm not a
computer scientist but what is a trivial change to a layman may be very
significant to a copyright attorney or a radiologist. If the algorithm switched entirely to being, say, a MIP of
subtraction or something like that, at some point there has to be some
disclosure and review, I would think, of the performance.
DR.
O'SHAUGHNESSY: Can I just comment on
that last point? They are very well
established guidelines that FDA has with manufacturers as to what requires a
change. Any change in the product has
to be evaluated against certain criteria and then those will be based on the
approved labeling. Everything that the
panel contributes here today will go into deciding what changes in the product
require further review by FDA.
DR.
STARK: Well, then for the FDA's sake
I'm not aware of what those are and they will do diligently well to merge that with
some of the insights we have learned today because certainly we've heard a lot
of novel things today that are novel to everybody in this room. They are going to be novel to the people
that developed those guidelines perhaps with the breast nodule detection in
mind but they may not be totally opposite here.
DR.
CONANT: The things that I raised before
just to summarize, and I'm not sure where they fit in preapproval or
post-approval because I'm not sure if we made that decision yet but, again,
it's a case-based analysis versus multiple nodules, quadrants, all that. You've heard that multiple times. A little more insight based on case-based
analysis of false positives and false negatives.
I
think that's really important. We've
been talking a lot about the false positives but I think the false negatives
are fascinating. What happens when
you've got really defuse lung disease?
One of the exclusion criteria here was greater than 10 nodules. I mean, what about someone who has -- I
don't know what disease that would be but a gazillion -- yeah, sarcoid,
right. Granulomas everywhere, old TB,
whatever. Where can this really be used
effectively and where does it really just fall down.
Also
your cases were over 19 years of age.
What happens in the pediatric?
You know people are going to start applying this everywhere. That just came to me recently. That has to be included, I guess, in the
labeling and certain analyzed. Whether
it's pre-post approval, I mean, that's what we're here for.
DR.
SOLOMON: I would just add the thoughts
on making the study more real life so collecting data maybe on the perspective
fashion that will essentially test the system in real life conditions. Real-life conditions for the doctor,
real-life conditions of diseases and everything, and I guess a real-life test
essentially.
DR.
TRIPURANENI: I would recommend the
same. I think whether it's pre or post
I think there needs to be a follow-up study of the patients that are going to
go through this to see what is the clinical impact ultimately.
DR.
IBBOTT: All right. Well, I think we're on the verge then of
deciding if it's going to be a pre or a post-approval study. Unless there are other concerns that you
want to address now, I suggest that we move on.
We
now come to a second half-hour open public hearing session. If there are any individuals wishing to
address the panel, please raise your hands and identify yourselves at this
time. Seeing none, then we move on.
Before
we move to the panel recommendations and vote, is there anything additional the
FDA would like to address?
DR.
DOYLE: Now that the panel discussion is
over, we would ask the sponsors to go back to their seats, please.
DR.
WAGNER: Fear not. I will not make a technical comment but
since Dr. Blumenstein's position is heavily influenced by some of his
statistical comments, I would just like to tell you that the issue about
correlation across modalities has been addressed in the literature by a number
of authors including myself and it's at the bottom of the third page of the
references there.
Modalities
are not a random effect but cases and readers are. The entire correlation structure is accommodated by the model
here. Also the sampling scheme does
sample the intra-reader variability, as I said this morning. Two out of three of your points are, in
fact, addressed in the literature.
Thank you.
DR.
IBBOTT: And, finally, is there anything
else the sponsor would like to address?
DR.
O'SHAUGHNESSY: No, thank you. We appreciate the questions very much.
DR.
IBBOTT: Thank you.
DR.
DOYLE: All right. We will now move to the panel's
recommendations concerning PMA P030012.
The Medical Device Amendments to the Federal Food, Drug, and Cosmetic
Act (the Act) as amended by the Safe Medical Devices Act of 1990, allows the
Food and Drug Administration to obtain recommendation from an expert advisory
panel on designated medical device premarket approval applications, PMAs, that
are filed with the agency.
The
PMA must stand on its own merits and your recommendation must be supported by
safety and effectiveness data in the application or by applicably publicly
available information. Safety is
defined in the Act as reasonable assurance based on valid scientific evidence
that the probable benefits to health under conditions of intended use outweigh
any probable risks.
Effectiveness
is defined as reasonable assurance that in a significant portion of the
population, the use of the device for its intended uses and conditions of use
when labeled will provide clinically significant results.
Your
recommendation options for the vote are as follows: Approvable if there are no conditions attached. Approvable with conditions. The panel may recommend that the PMA be
found approvable subject to specified conditions such as physician or patient
education, labeling changes, or further analysis of existing data. Prior to voting all the conditions should be
discussed by the panel.
Finally,
not approvable. The panel may recommend
the PMA is not approvable if the data do not provide reasonable assurance that
the device is safe or if a reasonable assurance has not been given that the
device is effective under the conditions of use prescribed, recommended, or
suggested in the proposed labeling. If
the vote is for not approvable, the panel should indicate what steps the
sponsor may take to make the device approvable.
DR.
IBBOTT: All right.
DR.
TRIPURANENI: May I ask you to read the
effectiveness statement again please? I
want to listen to it again.
DR.
DOYLE: I would be happy to do
that. Effectiveness is defined as
reasonable assurance that in a significant portion of the population, the use
of the device for its intended uses and conditions of use when labeled will
provide clinically significant results.
DR.
TRIPURANENI: Thank you.
DR.
IBBOTT: Would anyone on the panel care
to make a motion?
DR.
BLUMENSTEIN: I move "not
approvable."
DR.
IBBOTT: It's been moved not
approvable. Is there a second to this
motion?
DR.
STARK: I'll offer a second.
DR.
IBBOTT: I'm sorry?
DR.
STARK: I would offer a second.
DR.
IBBOTT: All right. It's been moved and seconded. Is there discussion then of this motion?
DR.
KRUPINSKI: Can we discuss the
procedure? Do we discuss it --
DR.
STARK: And then vote.
DR.
IBBOTT: And then we will vote.
DR.
KRUPINSKI: On that motion?
DR.
IBBOTT: On this motion.
DR.
KRUPINSKI: And then it takes two-thirds
to --
DR.
STARK: Majority.
DR.
KRUPINSKI: Majority.
DR.
STARK: If that motion doesn't pass,
then we'll ask for another motion.
DR.
CONANT: I'll say something. I think there is a lot of very rich data
here. There's more data we'd like, of
course, that they don't have like follow-up studies to your follow-ups. You know, what happened to the
patients. But within the data that
they've given us, I'm sure they can look at it by case and look at false
positives, even false negatives.
I
would hesitate to jump yet to not approvable without at least getting that data
that should be obtainable without IRBs and all that stuff because you guys
should have it on those spreadsheets by patient and have a second look at
that. That's where I stand with the
non-approvable part.
DR.
SOLOMON: I agree with what you just
said. I mean, I think we're put in a
difficult position here. I think all of
us seem to be asking for more clinically relevant case data. It seems like something you might have but
we don't have that information right now.
That's difficult when the statement for efficacy says clinically
significant results and it's hard for us without having necessarily those
clinically relevant information. I
think that pretty much sums up the issue right there.
DR.
TRIPURANENI: As a clinician we have
high-tech in radiation therapy using lots of machines and equipment to
follow-up things right in there. Having
practiced for more than 20 years, I have come to believe that any process you
improve typically improves the patient outcomes. Sometimes I believe it's a leap of faith but I think most of the
things that you do in the clinic that you improve usually improves the outcome.
I
would like to believe that actually the fact that you can actually pick up a
few more modules I think eventually will translate into some sort of positive
impact on patient management. I really
would love to see some data. In fact,
that's where I have the dilemma. I
asked Mr. Doyle to repeat the effectiveness statement right there.
I
think if you follow the rule of the law right there, I have to make a real leap
of faith that actually this is improvement.
My personal belief is that any improvement in the process will improve
the care so I have to really make the leap of faith to actually work for it but
I think it's a dilemma, as Dr. Solomon said, that we're all in. I really would love to see some clinical
data.
DR.
KRUPINSKI: Just to be specific, I think
what we're after is on a patient basis how many normals were then converted to
a false positive to abnormal and then how many false negative patients and back
and forth on each one. I mean, all
possible combinations. I think that is
specifically what we're looking for.
DR.
STARK: If I could offer another
analogy, a brief one.
DR.
IBBOTT: Are you addressing the motion?
DR.
STARK: Yes, I think so. I'll be brief. The issue of approving gadolinium DTPA for MR scanning of the
brain was obvious, as it is here, as we've heard from statisticians and
clinicians. Given the constraints of
this study it's really obvious to us that this technology likely makes things
better. But unlike the decision to
approve gadolinium at a cost of billions of dollars because we saw a few
anecdotes where it made things better, no one had an argument that it could
make things worse or make things less efficient. Here there are serious concerns that the marginal improvement in
efficacy which is perhaps buried in the statistics is offset by a much more
obvious risk to the patients here.
Forgive me if that's not on the point of the motion but I think the
panel has done a lot of soul searching and that's the reason why I think we
have hesitated -- my hesitation.
DR.
IBBOTT: It seems to me that this device
provides information that is not available otherwise and more information is
usually better. I share your concern to
some extent. Certainly not to the
degree that you do, I think, though, that people may misuse the device or take
advantage of it to relax in their own vigilance. I think the sponsor can address that.
Yes,
Dr. Ferguson.
DR.
FERGUSON: It seems to me that the
company has followed very carefully the suggestions of the FDA and I applaud
them for that. I don't think that we
should necessarily penalize them for that unless that's the will of the group
here because we are advisers only to the FDA.
I would side with those who think that more information is required and
I think it's been outlined very, very well what that information should be but
I don't think -- I would not vote for nonapprovable.
DR.
IBBOTT: Is there anymore discussion
before we prepare for a vote?
MS.
BROGDON: Dr. Mehta?
DR.
MEHTA: Yes, I'm here. I can hear the conversation.
MS.
BROGDON: Do you have a comment?
DR.
MEHTA: No, actually I don't have
anything to add at this point.
DR.
IBBOTT: All right. Well, in that case we will proceed to the
vote.
MS.
BROGDON: Dr. Mehta can vote if he
wishes.
DR.
MEHTA: Actually, I'm uncomfortable
voting because quite a bit of the time it was breaking up and I feel it would do
an injustice to the sponsor for me to vote if I've not heard everything
clearly.
DR.
IBBOTT: All right. Fair enough.
DR.
SOLOMON: Can I ask one question? As far as the categories go if the
nonapprovable and the approvable with conditions, where would going back to
your data and coming up with some of this clinical evidence that we're asking
for fall into?
DR.
IBBOTT: Well, at the moment we are
voting on a motion to declare this application not approvable. If that motion passes, then that's the end
of the discussion here.
DR.
DOYLE: But I think Dr. Solomon's
question is where would reanalysis of existing data?
DR.
STARK: Yes, that was the question based
on your definition.
DR.
DOYLE: That could be part of approvable
with conditions. That comes under that
definition.
DR.
STARK: Would non-approvable also invite
the manufacturer to resubmit answering the same questions? This doesn't go away forever.
DR.
DOYLE: No. In fact, if that were the case, we would ask each one of you to
recommend what you think the sponsor should do to make the advice approvable.
DR.
MOORE: Can I make a point? I would also second Dr. Conant's point that
I think a lot of the data that's being asked by the panel is in the data that
the sponsor has available. I think that
really should be taken into consideration.
Particularly if we're
thinking about additional studies here whether it be post-market or pre-market.
Obviously if it was non-approvable that would be pre-market. We really need to think about the
reasonableness and what it would take for a sponsor to do that.
I
think the companies worked very well with FDA in trying to identify what is
appropriate. I think it's not only FDA
that's kind of worked on that but sort of the industry of what's appropriate
for evaluating this. I think that needs
to be taken into consideration.
DR.
IBBOTT: All right. We will proceed to the vote then and I'll
remind you that the motion is not approvable.
I'll ask you to state whether you vote yes which means that you are in
favor of declaring not approvable, or no in which case you disagree with the
motion and would consider a different motion, or abstain. We note that Dr. Mehta has abstained. Dr. Krupinski, I would like to start with
you.
DR.
KRUPINSKI: No.
DR.
IBBOTT: No. Thank you. Dr. Conant.
DR.
CONANT: No.
DR.
IBBOTT: Thank you.
DR.
FERGUSON: No.
DR.
IBBOTT: Dr. Solomon.
DR.
SOLOMON: No.
DR.
IBBOTT: Dr. Blumenstein.
DR.
BLUMENSTEIN: Yes.
DR.
IBBOTT: Dr. Tripuraneni.
DR.
TRIPURANENI: No.
DR.
IBBOTT: Dr. Start.
DR.
STARK: Yes.
DR.
IBBOTT: All right. Well, we have two in favor, five opposed,
and one abstention. This motion does
not carry. We now come back to
entertaining another motion. I would
like to ask if someone on the panel would like to make a motion.
DR.
KRUPINSKI: Approve with conditions.
DR.
IBBOTT: The motion is approve with
conditions. Is there a second?
DR.
FERGUSON: Second.
DR.
IBBOTT: Dr. Ferguson. Now, we've had quite a bit of discussion but
perhaps other Dr. Krupinski or Dr. Ferguson would like to speak to the
motion. I'm sorry. The next step is to establish the
conditions.
DR.
KRUPINSKI: One at a time?
DR.
IBBOTT: One at a time, yes.
DR.
KRUPINSKI: One condition would be for
the post-analysis of the by-patient data
DR.
IBBOTT: And each condition requires a
second.
DR.
CONANT: Second.
DR.
IBBOTT: Dr. Conant seconded. Now, is there discussion about this
condition that would be attached to a motion to approve with conditions?
DR.
STARK: My question is does the motion
imply or should we specify that we are saying that's a condition where the FDA
must be satisfied before the product is permitted to be marketed?
DR.
IBBOTT: That is the meaning of
conditions, that it is approvable once the conditions are satisfied.
DR.
STARK: Okay. And approvable means it would be subject to FDA approval?
DR.
IBBOTT: That's right. We're making a recommendation to the FDA
which they then consider.
DR.
TRIPURANENI: Dr. Krupinski, could you
elaborate the condition? I didn't
understand. I'm sorry.
DR.
KRUPINSKI: Basically what we want
instead of the ROC analysis based on the quadrants is to say, okay, here is a
patient who is classified as normal.
How many times did the radiologist call that normal and then because of
the CAD called it false positive. And
vice versa where they initially called it false positive did the CAD make them
now call it true negative.
Then
how many patients no matter how many nodules they had radiologist says false
negative, the CAD correctly turns them to true positive. And vice versa how many times did the
radiologist call it true positive and the CAD made them reverse their patient
decision and call it false negative.
DR.
TRIPURANENI: Are you asking for a
post-marketing analysis or a pre-market analysis?
DR.
KRUPINSKI: No, re-analysis of the
existing data.
DR.
TRIPURANENI: Okay. Thank you.
DR.
STARK: Is it also implied that the FDA
-- that's a specific question but I think it is implied -- I'm asking is that
implied that is to -- certainly not to the exclusion, I would think, of the
many other questions the FDA might have based on our discussion today, or
should we add our own conditions and try to broaden that? I think so many things have been raised here
today.
I'm
so impressed personally with the qualifications of the FDA staff, the clinical
staff, Dr. Sacks, the statisticians, that I would want to give them broad
discretion and encourage them, in fact, insist that in addition to answering
your question that they address many of the other issues that they will see fit
to recognize in the transcripts of this proceeding.
DR.
KRUPINSKI: I'm not sure how broad each
division has to be.
DR.
DOYLE: There's no requirement either
way. Keep in mind that the FDA will
interpret these conditions so that you can state them in broad terms and we
certainly will work with the sponsor to refine them to specific actions. You don't have to spend a lot of time
wordsmithing these conditions is what I'm basically saying.
DR.
IBBOTT: Dr. Blumenstein.
DR.
BLUMENSTEIN: Let me have clarification
here. Are we talking about conditions
prior to approval or post-approval conditions?
I'm a little confused about that.
DR.
IBBOTT: These are conditions prior to
approval.
Yes,
Nancy.
MS.
BROGDON: If you have post-approval
conditions you want to include here, then you should.
DR.
IBBOTT: Thank you.
DR.
KRUPINSKI: So those would be like
follow-up on new patients. That would
be a post-approval?
DR.
IBBOTT: A post-approval for condition
for approval.
MS.
BROGDON: I'm sorry. I didn't understand your question.
DR.
IBBOTT: If we impose conditions that
cannot be met until after the device is marketed, then how can that be a
condition for approval? Or is it a
recommendation at that point?
MS.
BROGDON: These are all
recommendations. If some of them are
about post-approval data, then just identify them as such and we'll know how to
sort them out.
DR.
IBBOTT: Thank you.
MS.
BROGDON: If you have things that you
are specifically looking for, you ought to name them in your conditions.
DR.
IBBOTT: Good.
DR.
CONANT: I think things that are
pre-approval conditions before we get to post-approval.
DR.
IBBOTT: Let's deal with them one at a
time.
DR.
DOYLE: Let's try and deal with this one
condition.
DR.
IBBOTT: By the way, we need to vote on
each condition so before you --
DR.
CONANT: I seconded hers, didn't I?
DR.
IBBOTT: Yes. And are you speaking to that condition?
DR.
CONANT: No.
DR.
IBBOTT: Let's vote to make sure we're
in agreement to attach this condition and then we'll come back and add more
conditions. Is there any other
discussion about this condition? Then
let's ask Dr. Mehta again if he wishes to vote on these conditions.
MS.
BROGDON: Dr. Mehta, do you wish to vote
on any of the conditions?
DR.
MEHTA: I think I'm going to abstain on
that as well.
DR.
IBBOTT: All right.
Dr.
Krupinski.
DR.
SOLOMON: The only other thing on her
condition is to -- I mean, it was a very broad statement. Obviously the implication is that the
statistics remain favorable on the case analysis. I mean, it's implied.
DR.
IBBOTT: Good point. Yes.
DR.
KRUPINSKI: Yes.
DR.
IBBOTT: Thank you. Dr. Conant.
DR.
CONANT: Yes.
DR.
FERGUSON: Yes.
DR.
SOLOMON: Yes.
DR.
BLUMENSTEIN: Yes.
DR.
TRIPURANENI: Yes.
DR.
STARK: Yes.
DR.
IBBOTT: All right. Unanimously in favor of that condition.
Now,
at this point, Dr. Conant, you could introduce another condition.
DR.
CONANT: Always and never. Labeling issues. I thin everybody agrees on that to clarify the labeling
addressing the many issues we did.
DR.
IBBOTT: Is there a second?
DR.
KRUPINSKI: Second.
DR.
IBBOTT: It's been seconded. Do you want to elaborate on just how you
would like them to do that?
DR.
CONANT: Nobody really liked the second
paragraph about fatigue and lapses and to really emphasize this always and
never and to have the radiologist be ethical and moral and all those good
things. And to really downplay the
issues of statistical significance, to try to lay off that if possible.
I
think even right now the efficiency issues we don't really know that or we haven't
quanitated that so I wouldn't go there either.
Not even soft pedal I wouldn't go there. I'm sure other people have other things to include in that
condition.
DR.
IBBOTT: Dr. Krupinski.
DR.
KRUPINSKI: I think we should maybe
consider the possibility of adding the always never to the software. Not only are you trained on it but, say,
maybe every 20th case because you can keep track of who logs in, the reminder
comes up so it's made a part of their conscientiousness and you just don't have
it in that initial three-hour training session because no one is going to read
the manual. We know that so if it's not
in the initial three hours. In addition
as a later reminder.
DR.
IBBOTT: Any other comments regarding
this condition? All right. Then I think we are ready to vote on this
one.
Dr.
Krupinski.
DR.
KRUPINSKI: Yes.
DR.
CONANT: Yes.
DR.
FERGUSON: Yes.
DR.
SOLOMON: Yes.
DR.
BLUMENSTEIN: Yes.
DR.
TRIPURANENI: Yes.
DR.
STARK: Yes.
DR.
IBBOTT: Unanimously in favor
again. Then we'll -- oh, I'm
sorry. Dr. Mehta. He's abstaining from all these, we
think. One abstention.
Would
someone like to entertain another condition?
DR.
FERGUSON: The issue of formalized
training for those that are going to use the device. I like the idea of a CD-ROM.
I don't have to spell those out.
Everybody knows what those would be.
Most of the panel feels that it's appropriate to spell out a time. I don't think it's necessary for this device
personally.
DR.
IBBOTT: Are you suggesting that the
condition mandate training when the device is sold?
DR.
FERGUSON: Yes, I am.
DR.
IBBOTT: Is there a second?
DR.
KRUPINSKI: Second.
DR.
IBBOTT: Dr. Krupinski. Anymore discussion about this condition?
DR.
TRIPURANENI: Could you elaborate, Dr.
Ferguson, what exactly in broad context.
You want the technicians to be trained and you want a CD-ROM to be given
with some cases of false positives, false negatives?
DR.
FERGUSON: Yes. I think we've talked about all of those
things before. I can't remember all of
them or elaborate on them but I think they have a clear idea of what we need to
have rather than somebody buys the instrument and puts it in. I think we need a little more than just
having a technician, if you will, or an M.D. even. I don't know what level this person is that goes in for two or
three hours to train. This will be
protective for you as well as the patients.
DR.
IBBOTT: I'd like to comment. Also I support this and I would like to see
the sponsor consider some sort of remote review. This is digital data with DICOM.
There probably are mechanisms that a review could be done sort of
looking over the shoulder but from a distance so that it wouldn't necessarily
-- the training session wouldn't be restricted to the time that the company's
representative is on site.
Any
other comments? Okay. Then we'll vote on this motion. Dr. Krupinski, we'll start with you again.
DR.
KRUPINSKI: Yes.
DR.
CONANT: Yes.
DR.
FERGUSON: Yes.
DR.
SOLOMON: Yes.
DR.
BLUMENSTEIN: Yes.
DR.
TRIPURANENI: Yes.
DR.
STARK: Yes.
DR.
IBBOTT: One abstention and the
remaining all vote yes. All right. Are there other conditions?
DR.
TRIPURANENI: I'd like to propose a
first marketing surveillance. The
reason for that is I think the amount of patients that they have even though
they are going to do the pre-marketing analysis of the data, I'm afraid we may
not have enough number of patients to really tell us what is going on
there. They
looked at the quadrants and the number of nodules increase and all those
things. When you look at alive human
beings and the clinical impact, the significance is going to change. I think it's going to be really small.
I
would like to propose that we give the broad description to the FDA to kind of
come up with something in their best judgment post-marketing surveillance where
they can actually track that it really have a clinical significance.
DR.
KRUPINSKI: Second.
DR.
IBBOTT: Thank you. Any discussion?
DR.
CONANT: I think this is part of
this. I'm interested in the impact of
the CAD and other disease detection. I
don't quite know how to do this so I would want panel members to help with
this. For example, ground glass
opacities and things like that. I
wonder if this might not impact one's detection of some of these other
things.
Again,
it's broadening the population and I would recommend that they do a study with
less strict criteria looking at a more prospective group and analyzing the
impact of the CAD on the interpretation.
Why you would have to look at the interpretation before application of
the CAD of all diseases and look at it after.
I don't know if that is of interest to anyone else.
DR.
SOLOMON: I think that is essentially
what the post-market study would be is to look at any changes that come about
as a result of the CAD usage.
DR.
CONANT: Very general, right?
DR.
KRUPINSKI: Not just on nodules but
other things as well.
DR.
CONANT: Yeah, like mediastinal
adenopathy. It's that distraction
aspect I think someone brought up earlier.
DR.
IBBOTT: It would be difficult for us to
design a useful study in the next 10 minutes.
DR.
STARK: But is a potential condition of
approval to limit its approval to patients like those studied and perhaps data
can be shown to the FDA so it could be approvable for use with contrast
media. We've heard that's possible and
we haven't voiced any objections to that but conditional approval that it not
be applied to patients with obvious artifacts, other lung disease such as
ground glass nodules or pneumonia. It
hasn't been studied in children and I don't know if we're obligated to point
that out and ask for that.
DR.
CONANT: They did have other diseases in
their first group but they didn't look at how the -- there were others,
emphysema, ground glass, post-op, all that stuff. I'm not sure you can restrict it.
DR.
STARK: Have they shown us enough that
they can market to all comers or is it a condition of approval that this would
be marketed to all?
DR.
IBBOTT: This would be a new condition.
DR.
STARK: Either an amendment to the
existing motion or a new one.
DR.
IBBOTT: The motion is for a
post-marketing study which would certainly address the issues that you've
mentioned.
DR.
STARK: I didn't realize we had moved to
the --
DR.
IBBOTT: Yes. This motion we are discussing now is for a post-marketing
study. Surveillance.
DR.
MOORE: Just to make a point of
clarification to Dr. Stark's comments, I think in the company's labeling they
have made it very clear that there are certain type of abnormalities that are
not appropriate for this device. I
think some of the labeling already takes into consideration some of the points
that you've raised.
DR.
IBBOTT: Let's come back to the
discussion on the post-marketing surveillance.
Then, if necessary, we'll discuss the labeling again. Further discussion? If not, let's vote on this motion for
post-marketing surveillance.
DR.
KRUPINSKI: Yes.
DR.
CONANT: Yes.
DR.
FERGUSON: Yes.
DR.
SOLOMON: Yes.
DR.
BLUMENSTEIN: Yes.
DR.
TRIPURANENI: Yes.
DR.
STARK: Yes.
DR.
IBBOTT: And one abstention. All right.
DR.
STARK: I'm sorry if I missed the
boat. I didn't realize we had closed
the window and moved on.
DR.
IBBOTT: I don't think we've closed any
windows. We jumped to a motion to
attach a condition or recommendation for post-approval surveillance but I don't
think that prevents us from considering more conditions to approval.
DR.
STARK: Well, if I can, to catch up,
I've jotted down three to consider. All
of these are, of course, subject to the FDA staff's decision.
DR.
DOYLE: Hopefully one at a time.
DR.
STARK: Yes. I would suggest that until it has been proved otherwise, which
means in the current condition it hasn't been proved, that there be no claims,
expressed or implied, of clinical significance. And that there be no use of the term significance.
I'm
not just talking about lawyering this but in spirit as well as the letter of
this recommendation, significance or the like except, as I discussed before, in
the very narrow reference to ROC statistics and even then with some type of
explicit disclaimer that that's not -- was in a nonclinical setting.
The
only thing significant we've seen are statistics that are in a nonclinical
setting and those have help assure us of the safety and efficacy but I don't
think that should lead clinical radiologist to have to juggle claims of
significance.
DR.
SOLOMON: Do you see that being
dependent upon the results of this clinical analysis that we're talking about?
DR.
STARK: I don't think so. I would not say that satisfying anything
that we have made as a condition would release them from this condition, but if
the FDA finds additional data have established that this is clinically
significant, then I would say the FDA should be free to waive that condition as
a separate condition.
DR.
IBBOTT: All right. This is a condition you would place on the
labeling that the manufacturer must meet for approval.
DR.
STARK: Yes.
DR.
IBBOTT: Are your other -- you mentioned
that you had three items. Do they also
address the labeling?
DR.
STARK: They are labeling, yes.
DR.
IBBOTT: So perhaps we could group them
together?
DR.
STARK: Well, they might fail one at a
time.
DR.
IBBOTT: Then let's get a second on this
one.
DR.
KRUPINSKI: Can I ask is labeling the
same as advertising?
DR.
DOYLE: It comes under labeling
DR.
KRUPINSKI: It is? Okay.
DR.
IBBOTT: Is there a second?
DR.
FERGUSON: Second.
DR.
IBBOTT: All right, Dr. Ferguson. Okay.
Any further discussion about this?
This would be another condition placed on approval to presumably modify
the labeling -- existing labeling and certainly when designing any new labeling
to avoid claims of clinical significance.
DR.
CONANT: I'm not quite sure we can do
that yet. I want to see their data
first. I think that could come later
but I don't want to close the door on their data so I would be hesitant to vote
yes. I'm sorry.
DR.
STARK: I'm just saying if there is no
more data or if the FDA finds that data insufficient.
DR.
CONANT: Yes, sure. I trust that the FDA will do that but I'm
not sure -- yeah, it's kind of a condition on a condition. It's sort of one step at a time. I think we have asked a big condition of
looking at the data and that may all not show any kind of significance,
clinical or other that we are asking for and then that becomes obvious. I don't get that really.
DR.
IBBOTT: Dr. Blumenstein.
DR.
BLUMENSTEIN: I'm going to vote no on
this because I feel that I trust the FDA to deal with that given that we have a
preapproval condition for clinical data.
DR.
IBBOTT: Any further discussion?
DR.
CONANT: One other. Sorry.
David, in spirit I agree very much with what you're saying but we
already voted and yes'ed a condition on labeling saying they had to take the
stuff out. We did that a couple steps
ago. I think we have suggested that we
really feel this is important by voting on that. And then, again, the FDA is going to take it from there.
DR.
IBBOTT: I think I feel the same way
that we have asked them to do some more analyses of the existing data. The FDA may determine that detracts from the
significance.
DR.
STARK: I'd be happy to withdraw the
motion if there is a consensus, or we better take a vote.
DR.
IBBOTT: I think we can just go ahead
and vote if that's all right. I should
ask, though, is there anymore discussion before we vote? Dr. Krupinski?
DR.
KRUPINSKI: No.
DR.
CONANT: No.
DR.
FERGUSON: Yes.
DR.
SOLOMON: No.
DR.
BLUMENSTEIN: No.
DR.
TRIPURANENI: No.
DR.
STARK: Yes.
DR.
IBBOTT: There were two yeses and five
nos and one abstention. So that motion
is defeated. Are there motions for
other conditions to attach to the approval.
DR.
STARK: I have two more and I'll be
brief.
DR.
IBBOTT: Sorry.
DR.
STARK: That's okay. I would ask that it be added to the label
something to the effect or spirit of the following words. "Careful rereading or second reading
may be equally or more safe and effective in a clinical setting."
DR.
DOYLE: Could you say that again?
DR.
STARK: "Careful rereading or
second rereading may be equally or more safe and effective than a computed
second reading in a clinical setting."
DR.
IBBOTT: Is there a second for this
motion?
DR.
FERGUSON: Is that a directive to the
radiologists rather than the instrument?
DR.
STARK: It's a directive for -- I'm
intending it, and forgive me for exploring this, but what a radiologist faced
with purchasing this or using it will be told.
I am proposing that he should be told that if he simply reread the scan
himself or had a colleague double read it, that actually might be more efficient
and safe than this product.
DR.
KRUPINSKI: But you don't have any data
to support your contention.
DR.
STARK: That's why I said may be. They don't have any data to support
theirs. I'm trying. I've only got one more.
DR.
TRIPURANENI: I have difficulty with
this.
DR.
IBBOTT: We're looking for a second.
DR.
STARK: If I don't have a second it
goes. We'll move on.
DR.
IBBOTT: No seconds. All right.
DR.
STARK: Last, it's the same family. I'm just probing this boundary between nonapprovable
and approvable with conditions. Not
demonstrated safe or effective until there's data in patients with artifacts,
concomitant lung disease, contrast media use, or pediatric populations.
DR.
KRUPINSKI: Doesn't this come under the
post-surveillance type stuff that we were asking for?
DR.
STARK: I thought labeling. Condition of the labeling.
DR.
CONANT: I think we are asking again for
the data to be analyzed and included in that by case is looking at -- I mean,
there were cases with artifacts and things like that. I think that is part of what the false negative and false
positive analysis is going to provide us with.
Again, it's a limited case set but depending on what that shows, the
next set may be --
DR.
STARK: If it is understandable to the
FDA that we are assuming they are going to check this, I'm saying that we
haven't seen these data and I was asking as a condition that the FDA ask to see
it. I was just making that a motion. I mean, I know we can assume that they'll do
this anyway.
I'm
just trying to make it a specific direction.
Of course, this is all advice and they can ignore all of this but if
there is a consensus that they should do this, then that is, I think, the
purpose of the motion I'm making which is to ask them to.
DR.
IBBOTT: Go ahead.
DR.
CONANT: Could it be that we could put
this in the first condition which was the first preapproval condition that was
to go back and look at these cases and we talked about by-case compared to
by-nodule and quadrant, etc. Do you
want to step back and beef that one up a little bit?
DR.
SOLOMON: I think procedurally that will
be a problem.
DR.
CONANT: We can't do that? Okay.
DR.
IBBOTT: We can address this motion with
the understanding that, in fact, that is what will happen. We can deal with this motion independently
at the first.
DR.
CONANT: Could you reword your motion or
could you restate it again? I didn't
mean reword it. Just say it again.
DR.
STARK: Yeah, and certainly someone -- I
think all of these we are understanding that we haven't wordsmithed these. I'm simply suggesting that until the FDA
sees data, which we hope is available, it should be a condition of premarket
approval that the product will be labeled as not demonstrated safe or
effective, or safe and effective with the use of contrast media in the presence
of artifacts or concomitant lung disease or in pediatric patients.
DR.
KRUPINSKI: From a nonclinician --
DR.
FERGUSON: It's totally unexplored. I don't think we can suggest that the FDA
look at these because I don't think we can put that into a formal motion
because those things are unexplored as far as I know.
DR.
CONANT: I think that you're saying that
the labeling should read this but the point is if it doesn't get approved and
it doesn't follow this condition that we first said about reanalyzing the data,
there's no labeling here because it's not going anywhere. You're already jumping to labeling based on
the data. It's kind of contradictory
DR.
STARK: I am suggesting that if it is
approved based on whatever, but we see no data on contrast media, artifacts,
pediatrics, or lung disease that the labeling contain these restrictions.
DR.
CONANT: If we see no data on those
things.
DR.
STARK: If the FDA is not satisfied with
the data which includes not seeing any further data.
DR.
CONANT: Okay.
DR.
IBBOTT: The sponsor has indicated that
their data do include cases with contrast and cases with artifacts. Are you suggesting that when they do the
reanalysis that we've already asked them to do that they also pay attention or
conduct an analysis to look specifically at the impact of artifacts or with
versus without contrast?
DR.
STARK: Yes. I'm saying that they say they have data that we haven't seen and
that if they
-- offering them a choice of either satisfy the
FDA that when they offer statistics on their data that it's convincing and
labeling shouldn't apply or simply say we can market it and simply market it
with the warning that if you patient has artifacts we haven't demonstrated
safety and efficiency -- sorry, safety and efficacy.
DR.
IBBOTT: So, yes. I'm not going to try and rephrase your
motion but I believe that you're asking that the reanalysis we've asked them to
do contain those elements to look at artifacts, contrasts. There were no pediatric patients so we won't
include that.
Then
depending on the results the labeling should be modified to indicate that the
device is not appropriate for pediatric patients. For example, if the data don't support its use in pediatric
patients. Is that right?
DR.
STARK: That's correct.
DR.
DOYLE: We need a second
DR.
IBBOTT: Yes. We need a second.
DR.
KRUPINSKI: The pediatric issue, I just
talked to a clinician, could be significant.
I mean, if the CAD --
DR.
CONANT: I'm not really sure about
this. I haven't looked at a pediatric
chest -- well, actually I do on the weekends.
DR.
IBBOTT: No one will ever know.
DR.
CONANT: It won't get out of this
room. Obviously kids were not
analyzed. It was 19 and above so
obviously pediatrics should be a contraindication. That should be included in the labeling. I think we all agree about that definitely.
DR.
STARK: I think if we don't make a
motion it's not obvious at all because I could wear the other hat.
DR.
CONANT: I think I brought that up
earlier when I said that has to be one of the things that we address with
looking back over the data. At least,
I'm sorry if I didn't. I don't remember
what the transcript was but that's got to be something in the label and it's
not in the contraindication line. The
artifacts we could talk about as a motion.
That sounds like a good idea.
Maybe separate out from the artifacts and other things.
DR.
STARK: I don't know where we are in
pediatrics. Do we need a separate
motion? Are you suggesting that I
bifurcate this already complicated thing?
I'm just trying to point the FDA to satisfy yourself on these things or
exclude them.
DR.
CONANT: I think there's a difference
here of what there may be data on versus what there isn't a chance in hell they
are going to be able to analyze because there's no babies or kids. I think it is different. I think it's two separate issues so I would
say separate it.
DR.
STARK: If you don't mind, why don't you
make the motion on the pediatrics.
DR.
CONANT: Contraindication no. Is it 19 and over? Eighteen. Sorry. No one under 18 should be analyzed with this.
DR.
STARK: I'll amend my motion by dropping
the word pediatrics. We can deal with
that then and then you can have --
DR.
IBBOTT: You've withdrawn. It wasn't seconded so that motion is
withdrawn.
DR.
STARK: I think we are still discussing
it. I would like to say that until the
FDA is satisfied from the existing data set or some other data set but not --
I'm suggesting it's a restriction because we haven't seen the data here that it
be a condition that it be marked not demonstrated safe or effective in patients
with concomitant lung disease or with lung disease -- known lung disease,
scanning artifacts, or with contrast media.
Again, we know they have data on contrast media. I hope it will convince the FDA but I'm
asking that we require that.
DR. CONANT: Should we put pediatric under 18 first?
DR.
STARK: I eliminated that from my motion
hoping that you would carry forward with yours afterwards.
DR.
CONANT: Okay.
DR.
IBBOTT: You're seconding his motion?
DR.
CONANT: No. He told me to do it independently so I just did that.
DR.
IBBOTT: We need a second for the motion
he just made.
DR.
STARK: I think I'm trying to bargain
with you.
DR.
IBBOTT: You guys have to decide.
DR.
CONANT: I think that is still part of
the one we already passed where we've asked for further analysis of the
existing data. I think we have already
covered that. That's why I'm not
seconding it because I think we are already asking them. I mean, if you reanalyze the data and they
find that they can't support what you want, then yours is a condition on the
condition that they don't find it. But
if they do the analysis and find it, then your condition isn't needed.
DR.
STARK: I think it's sufficiently likely
that they are not going to have statistically convincing data on artifacts or
post-op patients or patients with pneumonia.
I am trying to attach a condition that will help the FDA simply say put
in the label you should be careful and not use it in these patients because
it's unproved. I believe they have data
on contrast media but I'm lumping them all of them in the same.
I'm
saying these are identifiable important subsets just like the pediatrics
issue. I'm simply saying specifically
look at the analysis for these things and assuming that there is not
satisfaction here in some of them, please label the product appropriately.
DR.
IBBOTT: Dr. Tripuraneni.
DR.
TRIPURANENI: I think there are lumpers
and splitters. I'm a lumper. I think FDA is hearing what we are saying
and I think rather than go down to the final nitpicking and actually spell out
everything, I would rather leave it to the broad discretion of the FDA to
decide the best in their best judgment.
I really don't support this element.
DR.
IBBOTT: We don't have a second
yet. Is anyone willing to second the
motion? All right. Does someone want to make the other motion
regarding pediatric patients?
MS.
BROGDON: May I make a comment
first? I just wanted to describe how we
treat contraindications. We use the
term contraindication to mean something you shouldn't do because there are data
that say you must not do that. There
must have been some sort of demonstration of harm. Short of that, there are warnings and there are cautions and
other things that you can say in the labeling that don't reach the level of
contraindication.
DR.
IBBOTT: Good distinction. Thank you.
DR.
CONANT: Maybe this is a post-marketing
study. They've got to apply it to
kids. I don't know if that -- maybe
it's just a warning saying there is no data to support this use under 18.
DR.
KRUPINSKI: Somewhere it has to be
stated or brought out in the manual or in the warnings or somewhere there is
obviously not a contraindication but it should be there somewhere.
DR.
STARK: The rocket scientist in me says
that why are children different than adults and it's probably going to
work. But as a human, as a parent, I
have a hard time saying these are just small adults. On the other hand, the admonitions of lumping and leaving it to
the FDA, this is all on record, I've spoken.
My conscious is satisfied. I'm
going to leave it to someone else to make a motion.
DR.
IBBOTT: This is certainly something
that could be included in a recommendation for a post-market study and I think
we have probably done that or implied that.
Any
other conditions people would like to attach?
DR.
CONANT: Have we figured out the
pediatric one?
DR.
IBBOTT: We have not. I have made the assumption that the sponsor
and the FDA understand from the discussion that a post-market study would
include pediatric patients.
Yes,
Nancy.
MS.
BROGDON: I'm advised that since the
sponsor has not indicated that it is -- could be used in pediatric patients,
FDA would in most circumstances include some sort of statement in the labeling
that it has not been studied and it is not intended for use in children.
DR.
CONANT: There you go. I'll second that motion.
DR.
STARK: I say yes.
DR.
IBBOTT: The relief is palpable. I think unless there are other motions for
conditions, we are ready to vote on the main motion which is for approval with
conditions, the conditions being those we've just discussed.
DR. DOYLE: The ones that were seconded and approved.
DR.
IBBOTT: That's right. So we do have the motion and so unless there
is any further discussion on the main motion, we'll proceed to a vote on the
motion to approve with -- as approvable with conditions.
DR.
KRUPINSKI: Yes.
DR.
CONANT: Yes.
DR.
FERGUSON: Yes.
DR.
SOLOMON: Yes.
DR.
BLUMENSTEIN: Yes.
DR.
TRIPURANENI: Yes.
DR.
STARK: Yes.
DR.
IBBOTT: And with Dr. Mehta's abstention
the rest of the votes are all in favor so that motion carries. We have declared this approvable with
conditions and we've approved a number of conditions. At this point we go around the room and ask the voting members to
explain the reasons for their vote. Dr.
Krupinski, again, we'll start with you and ask you to identify the reason for
your vote on the decision as approvable with conditions and also on the
recommendations. You can probably
summarize your reasoning.
DR.
KRUPINSKI: Why doesn't somebody else
start because, I mean, it seems like I would just say the entire conversation
we just had all over again. I agreed
with all the changes or the conditions that we brought up. I think they satisfied the questions we had
throughout the day and so I voted yes.
DR.
IBBOTT: I think that's fine.
Dr.
Conant.
DR.
CONANT: That's basically the same with
me. I'm just concerned about how the
statistics -- how the analysis will differ with case-based versus actionable
nodules and quadrants. I, again,
applaud you all for the beautiful study you have done and answering the
questions given to you by the FDA.
I
hope you have this data to show us because I think this could be a wonderful
tool. As these things go they only get
better over time. I think it really
could have benefit to patients. But I
really need that data.
DR.
IBBOTT: Dr. Ferguson.
DR.
FERGUSON: I agree with everything she
said.
DR.
IBBOTT: Dr. Solomon.
DR.
SOLOMON: I think you should be
applauded for dealing with the problem that is an important clinical
problem. I think there are two issues
that the panel is charged with. The
first one being safety. I think the
issues of always and never are the issues on safety and I thin there are ways
you can address these and we have discussed those today.
The
second issue that we are charged with is efficacy. I think the key word there is clinical efficacy and I'm not sure
we were able to see exactly the clinical efficacy with the way the data was cut
up and divided so that we think that if you were to look at it again with that
in mind, it might be able to get through to the FDA.
DR.
IBBOTT: Dr. Blumenstein.
DR.
BLUMENSTEIN: I was disappointed that
neither the sponsor came forward with clinical analysis, and I'm also
disappointed that the FDA didn't require that of them, especially since our
criteria before approval for efficacy has clinical efficacy mentioned in
it. I'm also discomforted by the unique
properties of this study designed that may lead to inaccurate assessment of the
ROC methodology.
DR.
IBBOTT: Dr. Tripuraneni.
DR.
TRIPURANENI: I would like to
congratulate R2 for actually coming up with this new concept. You are a pioneer in the CAD and it's good
and bad. It's bad that being the first
one we are going to hold you to a higher standard because we have ideas about what
is right and what is wrong. Somebody
else that is going to come after you their life would be a lot easier because
they are going to learn from your mistakes.
On the other hand, I think you have done a very good job on this.
I
personally think actually any improvement in the process actually will
ultimately lead to the improvement in care.
I think it's important actually that we continue to pursue to improve
the processes that ultimately improve the care. That is the reason why I think we attach those amendments and I
firmly believe it will make a positive impact on the patients. That is the reason why I vote yes with
amendments.
DR.
IBBOTT: Dr. Stark.
DR.
TRIPURANENI: Can I just add one
thing? I really would like to see FDA
asking for clinically efficacy because I participated in the Cardiovascular
Devices Panel, as I say, participated on the other side of the table a couple
of times and they kept pointing the table to where is the clinical data, where
is the clinical efficacy. I would ask
the sponsors to give us some clinical data when appropriate.
DR.
IBBOTT: Dr. Stark.
DR.
STARK: Well, first, as lead clinical
reviewer I would like to thank everybody on the panel, everybody in the
audience, especially R2 for listening carefully and responding to my many
adversarial comments. I think that was
part of my role here today to be both the adversary as well as one of the
voting judges. I thank the chair. It's been a very efficient, respectful
proceeding.
Having
said that, I, again, agree with Dr. Blumenstein's assessment as a
statistician. I note that both the lead
reviewers had a viewpoint strongly held that was overwritten by the rest of the
committee and I can now step back and agree with Dr. Conant who has emphasized,
and those reading the transcript would not have seen her facial expression and
the movement of her fist in terms of emphasizing that we are now relying on the
FDA staff to continue diligently what they have already said is a nearly
overwhelming task. Not just for their
manpower and resources but for their range of skills. I think this committee and the people in this room and a larger
group, I believe, needed to address this again to relook at these data but I
accept that I have been outvoted and we will now rely on what is clearly a very
competent, energized and well-supported FDA staff to essentially accomplish the
same thing that Dr. Blumenstein and I were pushing for but as Dr. Conant and
the majority have voted. Thank you.
DR.
IBBOTT: Thank you. I would like now to
ask the nonvoting representatives to comment on the recommendations that have
been made. Ms. Moore.
DR.
MOORE: Although I did not vote, I think
I would have been in agreement with the panel on recommending this for
approval. I think that any improvement
in our ability to detect nodules that are not being detected is an important
step forward and I commend R2 on their efforts and view of the data and trying
to move this technology forward.
DR.
IBBOTT: Mr. Burns.
MR.
BURNS: The conditions satisfy the
concerns that I had regarding the study size and the data set and the small
change in the area under the ROC. I
think by analyzing the data we will see if there is some better significance
with the data.
DR.
IBBOTT: Good. Thank you. I would like
to just give Dr. Mehta a chance to make any comments he might have.
Dr.
Mehta, do you have any comments?
DR.
MEHTA: No. I think I just want to thank Geoff Ibbott for doing an excellent
job of running the meeting. Although I
didn't hear all the proceedings, I think I heard enough to concur with what
actually happened. Thank you,
everybody.
DR.
IBBOTT: Thank you, Dr. Mehta.
Mr.
Doyle.
DR.
DOYLE: Before we adjourn for the day, I
would like to remind the panel members that they are required to return all the
materials that were sent pertaining to the PMA itself. Materials you have with you may be left at
your table and any other should be sent back to me at the FDA as soon as
possible.
DR.
IBBOTT: Thank you. Finally, I would like to thank the speakers
and the members of the panel for their preparation and participation in this
meeting. I would like to especially
thank Dr. Stark and Blumenstein for serving as lead reviewers for the panel and
doing an excellent job of summarizing this and helping the rest of us
understand it.
And
I would like to thank the sponsors for graciously responding to the many
questions that were aimed at them and for putting on an excellent presentation.
Since
there is no further business, I would like to adjourn this meeting of the
Radiological Devices Panel. Thank you.
(Whereupon,
at 5:20 p.m. the meeting was adjourned.)