FOOD AND DRUG ADMINISTRATION
CENTER FOR DEVICES AND RADIOLOGICAL HEALTH
RADIOLOGICAL DEVICES ADVISORY PANEL
MEETING
TUESDAY,
FEBRUARY 3, 2004
The
Panel met at 9:00 a.m. in Salons B-D of the Gaithersburg Marriott Washingtonian
Center, 9751 Washingtonian Boulevard, Gaithersburg, Maryland, Geoffrey S.
Ibbott, Ph.D., Acting Chairman, presiding.
PRESENT:
GEOFFREY S. IBBOTT, Ph.D., Acting Chairman
BRENT BLUMENSTEIN, Ph.D., Temporary Voting Member
CHARLES B. BURNS, M.S., P.H., Non-Voting Consumer
Rep.
EMILY F. CONANT, M.D., Voting Member
THOMAS FERGUSON, M.D., Temporary Voting Member
ELIZABETH KRUPINSKI, Ph.D., Temporary Voting
Member
MINESH P. MEHTA, M.D., via teleconference,
Chairman
DEBORAH J. MOORE, Non-Voting Industry
Representative
STEPHEN SOLOMON, M.D., Temporary Voting Member
DAVID STARK, M.D., Temporary Voting Member
PRABHAKAR TRIPURANENI, M.D., Voting Member
ROBERT DOYLE, Executive Secretary
FDA REPRESENTATIVES:
NANCY BROGDON
NICHOLAS PETRICK, Ph.D.
ROBERT A. PHILLIPS, Ph.D.
WILLIAM SACKS, Ph.D., M.D.
ROBERT F. WAGNER, Ph.D.
SPONSOR REPRESENTATIVES:
RONALD CASTELLINO, M.D.
PABLO DELGADO, M.D.
HEBER MacMAHON, M.D.
DAVE MILLER
KATHY O'SHAUGHNESSY, Ph.D.
A-G-E-N-D-A
Open Session
Call to order and the Panel Introduction, Dr.
Geoffrey Ibbott, Ph.D., Acting Chairman................................. 4
FDA Introductory Remarks, Robert J. Doyle,
Executive Secretary 7
Update on FDA Radiology Activities, Robert A.
Phillips, Ph.D. 13
Open Public Hearing
Open Public Hearing; interested persons may
present data, information, or views, orally or in writing, on issues pending
before the committee 14
Open Committee Discussion
Charge to the Panel, Dr. Geoffrey Ibbott, Ph.D. 16
Overview of Contemporary ROC Methods, Robert F.
Wagner, Ph.D. 17
Presentations on P030012 by Sponsor
Introduction,
Kathy O'Shaughnessy, Ph.D.. 95
Current
Clinical Practice, Heber MacMahon,
M.D...................................... 97
Device
Description and Clinical Trial
Introduction,
Ronald Castellino, M.D. 103
Clinical
Study, Dave Miller............. 115
User
Experience, Pablo Delgado, M.D..... 143
Summary,
Kathy O'Shaughnessy, Ph.D...... 148
Lunch
Presentations on P030012 by FDA
PMA
Overview, Robert Phillips, Ph.D..... 174
Clinical
Background, William Sacks,
Ph.D.,
M.D........................ 175
Clinical
Results, Nicholas Petrick, Ph.D. 179
PMA
Review Summary, William Sacks,
Ph.D.,
M.D........................ 202
Reports by Panel Lead Reviewers
David
Stark, M.D........................ 212
Brent
Blumenstein, Ph.D................. 225
Presentation of FDA Questions................. 232
Break
Panel Discussion.............................. 234
Open Public Hearing
Open
Public Hearing: interested persons may
present
data, information, or views, orally or
in
writing, on issues pending before the
committee............................... 309
Open Committee Deliberations
Panel
Recommendation(s) and vote........ 311
Adjourn
P-R-O-C-E-E-D-I-N-G-S
9:06
a.m.
DR.
IBBOTT: I would like to call this
meeting of the Radiological Devices Panel to order. I also want to request that everyone in attendance at this
meeting be sure to sign in at the attendance sheet that is available outside
the door. I would note for the record
that the voting members present constitute a quorum and is required by 21 CFR
Part 14.
At
this time I would like each panel member at the table to introduce himself or
herself and state his or her specialty, position title, institution, and stages
on the panel.
I'll
begin with myself. Some of you have
already figured out that I'm not Dr. Mehta.
Thanks to the vagaries of air travel and weather, Dr. Mehta is unable to
be here but is joining us by speaker phone.
I'm
Geoff Ibbott. I'm a medical
physicist. I work at the University of
Texas, M.D. Anderson Cancer Center in the Department of Radiation Oncology and
Radiation Physics. I'm a voting member
on this panel and have been for several years.
Obviously I'm standing in as chair for this meeting.
Then,
Charles, let's start with you and we'll go around the table and introduce
ourselves.
MR.
BURNS: Charles Burns, Professor of
Radiologic Science at the University of North Carolina. My primary expertise is Imaging Diagnostic
Physics and I'm a nonvoting consumer representative
DR.
IBBOTT: Thank you.
DR.
MOORE: I'm Deborah Moore. I'm the Vice President of Regulatory and
Clinical Affairs for Proxima Therapeutics.
I'm the industry representative for the panel and a nonvoting member.
DR.
STARK: I'm David Stark. My current title is President of MRI of Dettum
in Massachusetts. I'm a clinical
radiologist. I've been a chairman for
close to nine years and I know many of you.
I'm pleased to be here. Thank
you.
DR.
TRIPURANENI: Prabhakar
Tripuraneni. I'm head of Radiation
Oncology at Scripps Clinical in La Jolla, California. I have a practice and full-time clinician radiation oncologist
and I am a voting member. I think this
is my first or second date on the panel.
DR.
DOYLE: I'm Bob Doyle. I'm the Exec. Sec. of this panel.
DR.
BLUMENSTEIN: I'm Brent
Blumenstein. I'm a biostatistician in
private practice. I'm normally on the
General and Plastic Surgery Panel.
DR.
SOLOMON: I'm Steve Solomon. I'm a radiologist at Johns Hopkins
Hospital. I'm a consultant to the
panel.
DR.
FERGUSON: I'm Tom Ferguson, professor
emeritus of cardiothoracic surgery at Washington University School of Medicine,
St. Louis. I'm a temporary voting
member on this panel. I'm on the
Cardiovascular Device Panel.
DR.
CONANT: I'm Emily Conant. I'm the Chief of Breast Imaging at
University of Pennsylvania and sort of half research and half clinical at this
point. I'm a voting member.
DR.
KRUPINSKI: I'm Elizabeth Krupinski from
the University of Arizona. I'm a research
professor in the Department of Radiology.
My area of expertise is observer performance and image perception
studies. I'm a voting member.
MS.
BROGDON: I'm Nancy Brogdon. I'm not a member of the panel. I'm the liaison to the agency. I'm the Director of the Division of
Reproductive Abdominal and Radiological Devices.
Dr.
Mehta, would you like to introduce yourself?
DR.
MEHTA: Yes, please. I'm Minesh Mehta. I'm a radiation oncologist in terms of specialty and I'm the
Chair of the Department of Human Oncology at the University of Wisconsin. Generally when I'm there I'm chair of the
panel but today I guess I'm listening in.
DR.
IBBOTT: All right. Thank you, everyone. Mr. Doyle would now like to make some
introductory remarks.
DR.
DOYLE: Well, first on the agenda here
is appointment of the Acting Chairperson.
Pursuant to authority granted under the Medical Devices Advisory
Committee Charter dated October 27, 1990, and as amended August 18, 1999, I
appoint Geoffrey Ibbott, Ph.D., as Acting Chairperson of the Radiological
Devices Panel Meeting on February 3, 2004.
This is signed by David Feigal, the Director of the Center of Devices
and Radiological Health.
Now
I would like to read the appointment of temporary voting status. Again pursuant to the authority granted
under the Medical Devices Advisory Committee Charter dated October 27, 1990,
and as amended August 18, 1999, I appoint the following individuals as voting
members of the Radiological Devices Panel for the meeting on February 3, 2004,
and they are as follows:
Brent
Blumenstein, Ph.D., Thomas Ferguson, M.D., Elizabeth A. Krupinski, Ph.D.,
Stephen Solomon, M.D., and David Stark, M.D.
For
the record, these individuals are special government employees and consultants
to this panel under the Medical Devices Advisory Committee. They have undergone the customary conflict
of interest review and have reviewed the material to be considered at this
meeting. Again, signed by David W.
Feigal for the Center of Devices and Radiological Health.
Finally,
the conflict of interest statement. The
following announcement addresses conflict of interest issues associated with
this meeting and is made part of the record to preclude even the appearance of
impropriety.
To
determine if any conflict existed, the agency reviewed a submitted agenda for
the meeting and all financial interest reported by the committee
participants. The agency has no
conflicts to report.
In
the event that the discussions involved in any other products or firms not
already on the agenda for which an FDA participant has financial interest, the
participants should excuse him or herself from such involvement and the
exclusion will be noted for the record.
With
respect to all other participants we ask in the interest of fairness that all
persons making statements or presentations disclose any current or previous
financial involvement with any firm whose products they may wish to comment
upon.
Now,
if there is anyone who has anything to discuss concerning these matters which I
have just mentioned, please advise me now and we can leave the room to discuss
them. Seeing none, the FDA seeks
communications with industry and the clinical community in a number of
different ways,
First,
the FDA welcomes and encourages pre-meetings with sponsors prior to all IDE and
PMA submissions. This affords the
sponsor an opportunity to discuss issues that could impact the review
process. Second, the FDA communicates
through the use of guidance documents.
Toward this end, the FDA develops two types of guidance documents for
manufacturers to follow when submitting a premarket application.
One
type is simply a summary of the information that has historically been
requested on devices that are well understood in order to determine substantial
equivalence.
The
second type of guidance document is one that develops as we learn about new
technology. FDA welcomes and encourages
the panel and industry to provide comments concerning our guidance documents. I would also like to remind you that the
meetings of the Radiological Devices Panel for the remainder of this year are
tentatively scheduled for May 18th, August 10th, and November 16th.
You
may wish to pencil these dates in on your calendar but please recognize that
these dates are tentative at this time.
I'll repeat them in case you didn't get those. May 18th, August 10th, and November 16th.
DR.
IBBOTT: Thank you, Mr. Doyle.
At
this point Nancy Brogdon, who is Director of the Division of Reproductive,
Abdominal, and Radiological Devices of the Office of Device Evaluation has a
few words she would like to say.
MS.
BROGDON: Thank you, Dr. Ibbott. We have three panel members whose terms just
expired on January 31st. They are not
present today but we wanted to recognize publicly their contributions to the
panel.
The
first is Mr. Ernest Stern. Mr. Stern
was the Chairman and CEO of Thales Components located in Totowa, New Jersey,
and he was the industry rep on the panel for the past four years. He is now retired from Thales.
Mr.
Stern effectively represented various industries served by this panel and used
his position on the panel to apprise other panel members of commercial
considerations that they should take into account when making recommendations
on the various applications under review.
Second
is Dr. Wendy Berg. Dr. Berg was the
Director of Breast Imaging in the Department of Radiology at University of
Maryland at Baltimore. She served on
the panel for four years as a voting member.
Dr. Berg brought to the panel a high degree of expertise in the field of
mammography.
That
was continually called upon as novel mammography related devices were reviewed
by the panel. In addition, when asked,
she provided written reviews of complex devices applications that the agency
used as part of our in-house review process.
Third
is Dr. Harry Genant. Dr. Genant is
Professor of Medicine and Epidemiology, Orthopedics, and Surgery at the
University of California at San Francisco.
He also served as a voting member for four years. Dr. Genant brought to the panel a brought
spectrum of expertise with special emphasis on bone densitometry. His probing questions and insightful
comments on the pros and cons of the devices being considered were very helpful
to the agency as it reviewed the safety and effectiveness of new devices.
We
thank all of these past panel members.
each will be sent a thank-you from the commissioner along with a mounted
service plaque. Thank you.
DR.
IBBOTT: Thank you.
Dr.
Robert Phillips, the Chief of the Radiology Branch of the Office of Device
Evaluation will now give a brief update on the FDA radiology activities. Dr. Phillips.
DR.
PHILLIPS: Well, good morning
again. As you can see by the absence of
meetings between December '02 and now, we have not had a whole bunch of brand
new PMAs that we've brought to the panel.
In fact, in the last year we have not approved any PMAs.
However,
there have been some changes in the branch itself and we have brought four new people
on board as reviewers. These are Nancy
Wersto who comes to us from industry.
She's a radiological physicist and her interest area is in radiation
therapy products.
Then
we have Kish Chakrabarti who comes to us from the mammography side of the center. He is a physicist. His area of interest is mammography and imaging systems. Kish, are you here today? No.
Dr.
Barbara Shawback comes to us from outside.
She's a medical officer and her area is study and design in
rheumatology.
And
then we just had a new employee come on board, Sophie Packerel. She is a physicist who comes from the
University of Chicago and her area is CAD systems.
Those
are the four people that have come on board and ends my talk. Thank you.
DR.
IBBOTT: Thank you. We'll now proceed with the first of two
half-hour open public hearing sessions for this meeting. The second half hour open public hearing
session will follow the panel discussion this afternoon.
Both
the Food and Drug Administration and the public believe in a transparent
process for information gathering and decision making. To ensure such transparency at the open
public hearing session of the advisory committee meeting, FDA believes that it
is important to understand the context of an individual's presentation.
For
this reason, FDA encourages you, the open public hearing speaker, at the
beginning of your written or oral statement to advise the committee of any
financial relationship that you may have with the sponsor, its product and, if
known, its direct competitors.
For
example, this financial information may include the sponsor's payment of your
travel, lodging, or other expenses in connection with your attendance at the
meeting. Likewise, FDA encourages you
at the beginning of your statement to advise the committee if you do not have
any such financial relationships. If
you choose not to address this issue of financial relationships at the
beginning of your statement, it will not preclude you from speaking.
No
individual has given advance notice of wishing to address the panel. If there is anyone now wishing to address
the panel, please identify yourselves at this time.
Seeing
none, I would like to remind public observers at this meeting that while this
portion of the meeting is open to public observation, public attendees may not
participate except at the specific request of the chair.
We
can now begin the first open public portion of the meeting. We will now, as I said, proceed with the
open committee discussion portion of this meeting that has been called for the
consideration of PMA 030012 for a computer-aided detection, CAD device, that
assist a physician in identifying actionable, solid nodules in CT images of the
lung.
The
first presentation will be by Dr. Robert F. Wagner of the FDA who will give an
overview of contemporary ROC methods such as may be used in measuring the
effectiveness of the CAD and other imaging devices.
The
sponsor, R2 Technology, Inc., will then state its case for the PMA and they
will be followed by the FDA with its review of the device. We will proceed now with Dr. Wagner's
presentation.
DR.
WAGNER: Cybersource as I am, let us see
if I can -- okay. Progress or
regress? Let's not start from the back. Marvelous.
Thank
you very much, Bob. I'm glad we planned
this together this way. Good morning to
the members of the panel, my colleagues and visitors today. I must acknowledge the fact that Dr. Bill
Sacks and I were awakened by our respective wives at our respective homes every
two hours this morning to see what the weather would be like to see if we would
be able to make it and what time we should really get up. We are working against that as our
background.
I
would also like to thank my colleagues for giving me this opportunity to
present this tutorial information on an overview of the contemporary ROC
methodology as it is used today in the field of medical imaging and computer
assisted devices.
Of
course, most of us know what the letters stand for. ROC stands for receiver operating characteristic. This is the historic name that comes down to
us from the field of radar in signal detection studies where the problem is
you're looking at a field of clutter and the question is is there an airplane
in that clutter.
In
the field of psychology and this perception in eye and brain coordination
studies, this subject is often called the relative operating
characteristic. Some people are just
weary of the R and just refer to this as the operating characteristic because
that's really what it is.
Those
of us in the field of medical imaging have retained the name of receiver
operating characteristic. I think it is
because of our devotion to the classic literature from about 30 years or so ago
that we have just retained, the conservative people that we are. I see a person who has worked in this field
looking back at us.
Well,
now here is an outline of the talk. We
will spend a few minutes talking about efforts toward consensus development on
the present issues. Then we'll move
right into the ROC paradigm. We'll talk
about how it gets complicated by the problem of reader variability. How the multiple reader multiple case, or
so-called MRMC ROC paradigm, arose to address this problem of reader
variability.
Since
the ROC is a measurement, you have to have a meter stick of some kind so we'll
talk about measurement scales. There
will be a categorical scale, patient management or action scale and a
probability scale that we'll talk about.
Then
for today's submission, and submissions like it, there are additional
complications from the problem of location uncertainty, from the problem of not
really knowing the truth and dealing with uncertainty in the truth. Since the truth is uncertain, you really
don't know how many effective number of samples you really have.
When
you have a system that's going to cue readers about the possibility of lesions
on a case, there is a problem of reader vigilance that we will discuss. Finally, we'll give a little wrap-up which I
won't have to give because Bob Phillips just presented it for me.
Let's
start off now with efforts toward consensus development on the present
issues. The fact is that at the moment
we do not have an explicit FDA guidance on how to review, how to submit and
review issues like the present one.
There's been a lot of work going on and deep background as to how did we
get here.
The
basic idea is how do you use the classic concepts of sensitivity, specificity,
and ROC analysis to assess performance of diagnostic imaging and
computer-assisted systems. Especially
since there are many new issues and levels of complexity that come to the fore
as more complex technologies emerge.
At
the moment you see there is really no software to do the assessment task of the
problem we have before us. That's why I
would like to talk about piecemeal, all the different pieces and what is known
and what does exist at the moment because the sponsor had to put together a
creative combination of these many things.
So continuing on this little laundry list. I'll give you an historical laundry list of efforts toward
consensus development on these present issues.
That's
RSNA. Most of you recognize that. That's the big Radiological Society of North
America meeting that's held every year in November in Chicago that makes this
weather look very mild today. Then
following RSNA by a few months is the big SPIE medical imaging meeting. At the SPIE meetings we generally handle the
more technical aspects of the issues that come up at the RSNA.
Then
there's a society that meets every two years called the Medical Image
Perception Society of which Elizabeth Krupinski on our panel has been president
for 40 years I think it has been. Elizabeth
is the President of the Medical Image Perception Society. We hold various workshops and literature
every two years.
In
all these meetings every few years we do note progress in this field. There is tremendous progress going on but
it's without a doubt still an evolving work in progress. We are still not at the holy grail point
that we would like to be at but a lot of progress has indeed been made.
At
the good old FDA at our center in CDRH here at the FDA. One of the methods that I'll be talking
about today is the so-called multiple reader multiple case, the MRMC scheme
which has already been used for several submissions.
It
was used to break the log jam that was holding back digital mammography from
the market place so the MRMC scheme that I'll talk about in a few minutes was
used there. It has been used for all
successful submissions of digital mammography PMAs to our center.
This
method that we'll talk about in a few moments has also been used for a
successful submission in the area of a computer aid for lung nodule detection
on chest x-ray film that is in some way analogous to the present submission but
it's just on plain film.
NCI,
National Cancer Institute, also has lung image database consortium and
workshops. This is an NCI funded group
of five universities and the principle director of that project, I though I saw
him come in a moment ago. There he is,
Larry Clarke.
There
are five universities that work as part of this consortium and they are seeking
consensus on a number of things, one of which is how to put together a database
of annotated films of the kind that you would use, annotated CT slice images of
the kind you would use to train and test a classifier in this field of
computer-aided detection and diagnosis in lung cancer screening for
nodules.
So
that project is about half-way through its five-year history. A good two years underway right now. They are also addressing consensus on the
many issues that you have to deal with when you want to deal with such a
product.
For
example, how do you keep score statistically?
Once you know how to keep score, then you can start to design the size
of a database. How do you outline the
nodules? How do you keep score when
there's a hit when there is just finite overlap between what is known of the
lesion and what the reader marks? We'll
talk about this in a few moments.
Now,
two of here in our center have been quite active members of this LIDC from the
beginning. Let me see if I have another
comment here. Yeah. The thing I would like to bring to your
attention this morning is that there has been a great amount of communication
among all these resources here. A
number of us in our center here are active members of the research community in
this field.
Many
of us here and sitting just behind me have been very active in this area of
applying these methods to several of the submissions in the area of imaging a
computer-aided diagnosis. Several of us
are very active members, Larry Clarke's group here.
What
we have tried to do is see this as several quarters, four quarters if you will,
if a quadrangle all holding the windows open to the others so the people who
come in to us from industry at any given moment will know what is the state of
the art from the academia, from our own center, and from the LIDC.
We
presented them all the papers, all the current drafts even, and made sure that
everyone knows what's on the other people's mind methodology wise that is
outside the area of anything that is proprietary. Anything that is not proprietary is all strictly methodology or
statistics. We have tried to keep these
communication channels as open as we could.
Here
we go with the promised little tutorial and the fundamentals of the ROC
paradigm itself. The idea is, of course,
that you have two populations, one a population of actually diseased
people. You might think of these as
people with diabetes, for example, and a population of people who do not have
the disease.
You
would like to have a test that puts out a result something like a volt meter or
a biochemical assay or, in the case of a simple blood sugar test, this would
just be the blood sugar concentration.
You would love to have the world such that the two populations would be
separated and you could just drop a threshold in here and say these patients
are diseased and these patients can go home and not worry about it.
Now,
in the field of medical imaging those of us who have done work in that field
you don't have a simple meter or biochemical assay. What you get is a reader looking at about a million pixels of a
picture and trying to get the features out of it and reduce that through what
we call the subjective likelihood, subjective judgment or likelihood that case
is diseased.
Now,
as I say, this is really not quite the way the diabetes blood sugar test works
but if you think of what I am about to tell you in that context for the next
few minutes, you won't be far off base.
It's not precise but it wouldn't be misleading.
So
here is what happens more typically.
The two populations are not separated.
The diseased population and the nondiseased population as far as their
test result is concerned have a very great overlap. The idea is now who do you send home and who do you send on for
further workup or people that you want to treat for a condition.
Those
of you who have seen this before, what I've just done I've taken these two and
dropped this population down so that you won't get mixed up with the
colors. Now we have the nondiseased
cases and the diseased cases on the same axis, the same relative position. Now in a practical situation with the
overlap, now we have to set ourselves a threshold.
If
this is a blood sugar test, for example, you could set it at 150 blood sugar
level. If you do that, you'll pick up
about half of the actual diabetic patients so we say we have a true positive
fraction of 50 percent but you have to pay for this price. You have about a 10 percent false positive
fraction so here is this point, 50 percent true positive and roughly 10 percent
false positive.
We
call this a less aggressive mind set and I think you'll see the reason for that
in just a moment. So if we get a little
bit more aggressive to try to pick up more patients in our sieve, we might set
the threshold down here at 100 instead of 150.
Now we get about 80 percent of the diabetic patients and now at the
price of about 20 percent false positive or 25 percent. Here I've put this point about 80 percent
and 25 percent.
Let's
get even more aggressive and what I mean by that is I want to pick up more
diseased patients in my sieve, the sieve being the test. If you set the threshold in the 90's, now we
might get almost 95 percent of the patients in our sieve of the actual diabetic
patients but then we have to pay the price of 50 percent of the nondiabetic
patients picked up so now we have a 90 percent sensitivity and roughly a 50
percent sensitivity.
Now,
you can take this to the extreme and we talk about this particular test all the
time and I think this might not work because the threshold now -- oh, it did
work. Okay. We can put the threshold all the way to the left and call
everybody to the right of this diseased and we would get all the diabetic
patients. There's a little mark right
up here. We would get also -- the price
we would pay is we would have to call everybody who is not a diabetic a
diseased patient here so we would generate that point.
I
think you can see and let your imagination go wild that you can certainly fill
in all these points. Don't blink,
anyone. I saw Dr. Bob Doyle blink there
so I have to go back and do that again.
Instead of working up more and more levels of aggressiveness, you could
back off. You could start off with
everybody at the sick point and then just back off, move the threshold the
other way and fill in the complete ROC curve.
You can see at this time of day I'm very easily amused.
Okay. Here is the overall picture now. This is the case of the schematic of, let us
say, blood sugar as a test for diabetes.
These are these two populations and the way they overlap and here is the
corresponding ROC curve with the level of aggressiveness increasing.
Now,
it can happen and, in fact, we've seen things like this in our center and you
see this in the laboratory once in a while, the two populations could fall
right on top of one another so that a test cannot actually discriminate between
the two conditions so what we've done here is just drop this population and
this population on top of each other.
Now if you generate an ROC curve the way I just showed you, you would
generate what we call the chance line or guessing line.
Toward
the other extreme you could have a test that separates the two populations very
well. In that case, as we move the
threshold across from less aggressive to more aggressive, we'll generate this
ROC curve. Now we have the guessing
line, we have the ROC curve corresponding to almost typical clinical laboratory
test, and we have the ROC curve here for a very good test. We call this the level of increasing -- we
call this direction the direction of increasing reader skill or increasing
level of technology.
Now,
many people like to have a single summary measure of ROC curve performance and
what has traditionally been used is you take the area under the curve so the
area under this curve, say the diabetic discrimination test, is something in
the high 70s. Let's call it 78 percent
or something like that.
If
you use the area under the curve as a summary measure of performance, in
effect, remember if you think of calculus, you're getting this area you're just
integrating, you are effectively replacing the curve with a line that is fault
at the level of that area.
In
effect, what you've done is you have averaged the sensitivity with a true
positive fraction over all false positive fractions. In effect, if you use the area of the curve you are given the
sensitivity averaged over all false positive fractions or sensitivity averaged
over all specificity, specificity coming from the other direction.
Well,
I hope it gets interesting now. That
was the easy part. That's the
idea. Let's see what really happens in
the real world. In the real world in
the last decade those of us who work in this field have been made acutely aware
of the complication of reader variability.
I'm
going to show you some very famous data.
I think Emily Conant knows this like the back of her hand from having
worked with Craig Beam. For those of
you who have not seen this before, I have to give a little build up to
this.
This
is a set of data from Beam, Layde and Sullivan that I'm going to show you in
which they studied 108 mammographers randomly chosen from around the United
States. The mammographers in this study
were given a set of mammograms. They
were asked to set their threshold for action.
Remember
when we were talking about this ROC paradigm we were moving a threshold and we
wanted to set it at some place and the question is in a clinical laboratory
test you could just dial that in somehow.
How do you do it in medical imaging?
You don't have a dial.
You
have to deal with the human reader and they were asked to set their threshold
between their sense of the boundary on the BIRADS scale, Breast Imaging and
Reporting and -- Reporting or Recording?
Anyway, Reporting and Data System.
That's the American College of Radiology Scale that is used for managing
patients in mammography.
These
readers were asked to set their sense of the boundary between category 3, which
is generally six-month follow-up recommendation, and category 4 which is highly
suspicious and recommend consideration of biopsy. I'm sure I'm garbling that but you get the general idea. I wasn't asked to leave the room so I
couldn't be too far off there.
Here's
what happened. This is a true positive
fraction versus a false positive fraction for 108 readers. There are 108 points here. Each one of these people thinks that they
had set the boundary between category 3 and category 4.
If
you try to do public policy based on category 3 and category 4 and thinking
that people have optimized that, the optimum is very broad. People have not figured out how to optimize
that. That's a big problem.
Let's
look at this reader. This is one out of
108 people. This person has a
sensitivity of 70 percent and a false positive rate of about 25 percent. Now, this person thinks they are being as
aggressive as they should be in the context but this person is more aggressive
than this one, this reader is more aggressive than this one, this reader is the
most aggressive on this bottom curve here, and these readers are less
aggressive.
Now,
as we go in the other direction, we now see the variability due to the range of
reader skill. We can say that these
readers have a greater skill at this task than these readers and these readers
have the greatest skill yet.
At
any level of reader skill we have different readers thinking that they have
optimally set their threshold. This is
a tremendous range of reader variability.
There are 108 mammographers represented on this graph. This is classic work from Craig Beam, Peter
Layde and Dan Sullivan.
What
have I just told you? There is no
unique ROC operating point. Each one of
these people is set to be at a certain operating point. There is no unique ROC operating point. There is not even a unique ROC curve. There is only a band or region of ROCs as
you can see. There is a very broad
band.
I
hope I've convinced you all now that this gets to be a more complex issue. In particular, here is the question. Suppose we have two technologies that
manifest themselves in reader's hands with this level of variability?
How
do you compare those two technologies?
That's the issue before us with a whole class of problems that we've
been discussing over the last few years and we'll be seeing more of over the
next few years. How do you do it?
This
is not an isolated example. People have
gotten used to this and said this is really an extreme example. This is not the most extreme example we've
ever seen.
In
our group we have actually looked at over a dozen real world publicly available
data sets and the example I just showed you is sort of in the middle. Sometimes things are a little bit
better. Sometimes they are even much
worse than what I just showed you.
Sometimes things are a little bit better. Sometimes they are even much worse than what I just showed
you. The following is an example from
Dr. Jim Potchen from plain chest x-ray picking up the disease on chest
films. These are ROC curves. Dr. Potchen looked at over 100 radiologists
and 71 residents. He averaged the score
card ROC wise of his top 20 radiologist.
Here they are.
Then
he presents here the average ROC curve for his radiology residents. There are 71 of them here representing this
average line. The bottom 20
radiologists in the study performed here.
The range that we see here is comparable to what we saw in the Beam, et
al. study for mammography. So this is
the real world.
Well,
you can imagine that if you wanted to keep score under that setting you have to
use a lot of readers and a lot of cases.
The paradigm that has emerged to address this is, thus, called, almost
eponymously, I guess, if I could pronounce that word, the multiple reader
multiple case, or MRMC paradigm.
There
are a lot of designs for this. There
are many ways to do it. Today we will
just talk about something that is called the fully -- oh, I forgot my prop. We'll talk about the fully-crossed
design. The fully-crossed design is one
of many but it is the most efficient in some way so we will talk about it.
You
match cases across modalities and you match readers across modalities. If I can pull this off. I'm used to having leaves of paper
here. Okay. You have a bunch of patients who have been imaged with modality A
here. The same patients imaged with
modality B so we say that the cases are matched across modalities.
If
we were working with computer-aided diagnosis, modality A would be readers
reading without the computer aid and modality B would be readers with the use
of the computer aid. There is a stack
of images here. Same patients.
We
recruit a panel of radiologists, something like 15 of you people here. All of you read every patient case in both
modalities. What we have then is we
have the cases matched across modalities and we have the readers matched across
modalities.
This
design is the most statistical power for a given number of readers and for a
given number of cases with verified truth.
Thus, we say it's the least demanding of these resources. Around here in Rockville we speak of this as
the least burdensome paradigm because you probably heard in previous meetings
that the FDA has been commissioned by Congress to enable sponsors to seek and
to find, if possible, the least burdensome path to the marketplace through the
review process.
So
what we've done is we've always called this to the attention of incoming
sponsors that this design is most powerful.
You can use alternative designs and you can come close sometimes to the
efficiency of this scheme but this is the most powerful in terms of the ground
rules I have on the slide right there.
Well,
if you are familiar with the literature in this field, you will say, you know,
this is no modern big deal. This stuff
has been known for a good 20 years or so.
If you read the classic book by Swets and Pickett the whole idea is laid
out there. The trouble is there was no
practical way to implement this scheme 20 years ago until people started to
understand what's called the statistical approach of resampling strategies.
I
probably shouldn't spend any time on the past history but the fact of the
matter is in past years before they realized about resampling they just started
to stratify the data and then you give up a lot of statistical power. In modern times in the last 10 years people
realized if you use the statistical resampling, you can use the data over and
over again in a well-pedigreed way and get statistically valid inputs.
So
the two most famous resampling schemes are called the statistical jackknife or
the statistical bootstrap. The big
break through came in this field in 1992.
This is the classic so-called DBM paper. That's Donald Dorfman of happy memory whom we lost to out
community very sadly two years ago. His
colleague, Kevin Berbaum, and the well-known Charles Metz at the University of
Chicago.
This
paper broke the log jam in this field.
They suggested using the statistical jackknife in combination with
classical ANOVA and the statistical jackknife just being a leave-one-out method
where you leave Mrs. Jones out one time and you leave Mrs. Smith out the next
time and you generate a lot of data sets that way, submit it to classical
ANOVA, and you can do your inference about the difference between these two
competing technologies.
Well,
it turns out this is a little bit more difficult to explain in any more detail
than that. But the bootstrap method is
very trivial to explain in some detail so I'm going to ask you to sit through
that with me for the next minute or so.
The
idea with the statistical bootstrap is that we are going to -- the bootstrap
itself means you are going to resample from a set of data points with
replacement. I'll show you that in a
moment. We are going to bootstrap the
experiment of interest. We'll draw
random readers, random cases, and then carry out the experiment of interest
many times.
Here
is an example of some possible bootstrap samples from a set of -- suppose there
are 15 of you here. We might have a set
of numbers one through 15. We start
drawing them with replacement. If you
wait long enough, you might get a list that has one, two, three, four, five,
six, seven -- you have to wait a long time before that happens.
In
the meantime you get more random looking samples like this. When I was thinking about this, you know, if
you did this with letters this reminds you of that proverbial experiment where
they have the monkeys trying to type out the soliloquy of Pollonius or
something like that. It's going to
happen but you may have to wait a long time.
Instead
what you do is you get random samples like this. The number one never showed up in this group. The number two showed up once. Number three showed up a couple times. Number 14 showed up three times and so on. You randomly sample a number and then put it
back. Write it down. This can go on for an astronomical number of
times.
Then
another example, the number one shows up, number 15 shows up and so on. You get a lot of these, a very great number
of these but you don't have time to do them all so, in practice, people use
about 1,000. It depends on the
complexity of the problem.
So
you draw about a 1,000 bootstraps of readers and cases. The number of cases you draw is comparable
to the experiment you are trying to mock up.
Then what you do is with that bootstrap safe on the random case sample,
you have all the readers in their bootstrap sample read all the cases in both
modalities in that bootstrap sample, carry out the experiment of interest so
you would get the performance measure.
That's
called area under the RC curve for the one.
You get that number for the other.
You take the difference. You do
that 1,000 times and then you put them in order from the lowest different to
the highest. Then it's very easy to get
the mean and then you can take out the central 95 percent junk and that would
give you a 95 percent confidence level.
That's a simple way to explain the story.
In
the jackknife plus ANOVA it's a little bit more elaborate than that but you can
actually think of the jackknife as the first order of approximation to the
bootstrap. So these two approaches are
sort of in the same spirit but one is completely nonparametric and the other is
-- the classical ANOVA is heavily based on the multi-variate normal so it's
highly parametric.
As
I just said, you obtain a mean performance over readers and cases but it's much
more interesting. The mean is always
easy to get no matter how you approach a problem. Well, it can be tricky.
But the big thing you want is error bars that account for both the
variability of readers and cases.
You
know, in the DBM paper they quoted a quote that has become very famous from Jim
Hanley. Many of us know Jim Hanley from
McGill University in Montreal.
Jim
Hanley says, "When you report the results of your experiment to your
readership, it's not so important just to report the mean performance or the
results you got in the very experiment at hand because, after all, this
experiment will never be done again. No
one will ever do this particular experiment.
What
readers want is they want a sense of the range of performance to be expected if
this experiment could be repeated many, many times drawing randomly, one hopes,
from the same population from which the current samples were drawn. So that is the idea.
You
ought to be able to report to your readership not just a p-value because we all
know it takes p-value to get a paper published in a medical journal. You want to actually be able to explain the
range of variability you expect to see if this experiment is done over and over
again. That's what you get when you
keep score this way.
Okay. We said that the ROC curve is a
measurement. Above all else it is a
measurement so you have to think about a measurement science. You have to think about the scale you'd be
using for reporting and doing the measurements.
Historically
-- I should just stop for moment to tell those of you who were not around in
the late '70s and early '80s that the National Cancer Institute gave a contract
to people in Cambridge, Massachusetts, Bolt, Beranek and Newman, where John
Swets, David Getty, and Ronald Pickett and colleagues were working to develop a
protocol for how to do ROC experiments and how to keep score and how to do the
data analysis.
That
is published in a paper in science 1979.
The book came out in 1982 and many of us have that book on our shelf. The protocol used at that time was so-called
historic ordered category scales. There
was no does this patient go to biopsy or not.
You just looked at the case and you said this patient -- you use five or
six categories.
One
patient you might say this patient almost definitely does not have
disease. There are several intermediate
levels. The patient probably does not
have disease, might have disease, probably does have disease, or almost
definitely has the disease. That scheme
of five or six categories was almost exclusively used and there was software
for analyzing that for 25 years.
I'm
being a little defensive because people may say why do people use that. That was approved by -- the experts in the
field put it out and it was supported by NCI.
There was a lot of science underneath it and today people say, "Why
did people do that?" Well, that's
what they had.
In
the last 10 years in the field of mammography we have this BIRADS scale which
is what we call an action item or a patient management oriented scale. In that idea you don't categorize the
data. People think of the BIRADS scheme
as a categorization scheme. Let's just
put that to the side for a moment.
We'll
just think of using the BIRADS scale to dichotomize patients. We'll say these patients will not be
followed up at all versus these patients who will get a six-month
follow-up. That's one way to
dichotomize the data.
Another
way to dichotomize the data is to say we will try to make the break as we did
with the Beam, et al. data. We'll make
the cut in this dichotomization between those patients who would get six-month
follow-up versus those who we think should be biopsied right now. So this is a patient management
scheme. This is just a dichotomization scheme.
About
10 years ago people realized for very technical reasons that it would be useful
to use what they called the continuous probability rating scale, or
quasi-continuous. It's a hundred-point
scale, one, two, three, four, five, but you wouldn't get 1.5 for example so
they call it quasi-continuous, hundred-point scale.
Nobody
expects anybody literally to use probability 13 or probability 17 or anything,
but the idea is to scale your probability or your sense of the likelihood of
disease along a probability scale. That
seems natural to use something if it's a probability on a scale from zero to
100.
So
this is the most popular scheme that's been used to generate ROC data in the
last five or seven years or so. This
felt strange to many people, especially people who are used to using the
categorical scale. But I've talked to a
lot of people about this and very few people outside of the mammographers have
read the BIRADS document.
If
you go through the BIRADS document and you go to category four, which is
suspicious and recommend for biopsy, it actually tells you there that the
radiologist should tell the referring physician their sense of the probability
of cancer. There is actually a culture
already existing in which you can use this kind of patient management action
items like a BIRADS three, four, five, and at the same time give a continuous
probability of disease rating.
I
see some puzzled looks. I'm trying to
figure out just what I should comment on next.
So to make a long story short then, this continuous probability rating
scale has been used for most ROC curves generated in this community for the
last eight or so years. In the breast
imaging --
Oh,
I remember what I was going to say.
That's why I'm stalling here. In
the breast imaging community many people, it may not be more than half, but
people do use this BIRADS scale. But
it's really important to realize that this BIRADS scale was not generated --
was not designed to generate ROC curves.
People who have tried to
use a five-category scale in this scheme and the BIRADS scale at the same time
have met with a lot of confusion. It
does not work out very well and I see somebody who may have witnessed people
having that experience.
Well,
I gave a lot of background here because I would like people to understand that
this is a real issue for the community you would really like to have because
every clinician says, "I want to know the patient management and I want to
know the score card of the patient management." Every clinician you talk to, that's what they want.
Everybody
who measures ROC curves says, "I want to measure it as finely as I
can. I want to use this
quasi-continuous reporting scale."
The best of both worlds would be to get both the quasi-continuous rating
to get the ROC curve and the patient management action item to get a single
sensitivity specificity point.
I'll
get a little dramatic for a moment here.
I've talked to many friends. I'm
very familiar with the literature. I
could find one example in all the literature at the moment that's in print
where both of these were done. I could
only find one example of where the best of both worlds was done. This
is a paper on classification, what Bill Sacks and others called CADx using a
computer not to detect but to classify lesions on a film that are already
known. I know that I have a stack of
films here that have microcalcification clusters on them. My
task is just to say which ones are benign and which ones are malignant. That's the task. But I'm going to keep score ROC wise and I'm also going to keep
score patient management wise. I'll
show you what they got in a moment.
These
authors -- Yulei Jiang, I guess, was expected here today from a group in
Chicago under Kunio Doi. They studies
this test and they had 10 readers and they studied the complete ROC
curves. They studied all the summary
measures and they also studied the patient management or the action item,
sensitivity specificity point.
Here
are the results. Here is the average of
10 ROC curves for 10 readers trying to make this dichotomy, trying to make this
distinction between benign and malignant lesions. Here is the ROC curve in the unaided by computer condition. This curve was generated using the
hundred-point probability scale.
This
is the curve in the computer-aided condition, again generated by the
hundred-point probability scale. This
point is the mean sensitivity specificity point generated just by making the
threshold, dichotomizing the data.
These patients benign, these patients malignant. This is a single dichotomy patient action
point in the unaided condition.
That's
the same point in the aided condition.
You would love these points to fall on top of the curves and, for all
statistical purposes, they do because remember the mean -- I have to remind you
of this famous joke that we use around here.
There was a six-foot statistician.
You know what happened to this fellow, right? He drowned while wading in a stream that had an average height of
five feet. You have to know about the
variability.
This
is not about means, okay? This curve
moves all over the place and this curve moves all over the place in
practice. This is the average of 10. Same thing. This point moves all over the place as does this. For all practical purposes this is a great
experiment. This point falls on that
curve.
Well,
it's the only case I could find in the literature. How come you don't see more of this? When you live with these people that I live with, it's a great
crowd of people and the clinicians say, "I want the action
point." I say, "The committee
wants to measure the ROC curve."
Everybody says, "Let's do both." We are trying to come to that position. Why don't we see more of it?
Well,
the area under the ROC curve, remember, you have your ROC curve and you've got
the area under it. You are essentially
getting the sensitivity averaged over all specificities. Right?
You're averaging. You're going
to average away a lot of noise.
The
variation -- the variance of the area under the ROC curve -- oh, my
goodness. The most important number of
my entire talk is missing. The variance
of the area under the ROC curve is the binomial variance over two. There's a two here, a very important
two. Those of you who know me know I'm
an expert in factors of two. It's the
binomial variance over two.
What's
the binomial variance? Well, I thought
if you had a group as we have here today, about a third of you -- maybe 40 percent
of you as I look around -- know what the binomial variance is. Suppose we had this meeting next week and we
drew from the same population from which you all came.
The
next time we did it we might get 32 percent of you might know what the binomial
variance is. If we do it three weeks
from now and joint another group in, maybe 49 percent or 52 percent of you will
know what the binomial variance is.
What
we've just done is what Bill Sacks refers to.
We just made a self-referential example here. The binomial variance is the variance I would experience if I did
the experiment I just discussed with you.
The area under the ROC curve experiences only half of that variance.
If
I studied sensitivity by itself and was able to tell you ahead of time what the
specificity was so you didn't have to estimate the specificity, the variance of
sensitivity is the entire binomial variance.
In
the real world you have to estimate both the specificity and the sensitivity so
the uncertainty in the specificity propagates into that and the sensitivity so
the variance for that. So if you wanted
to estimate the uncertainty in that action item that I showed, that point, the
circle or the triangle in the previous data, if you were to estimate that, you
would have to live with an uncertainty that was greater than the binomial
variance.
If
you use area under the RC curve you get a great reduction. You get the binomial variance over that
famous factor two. This is all
approximate but it works out very well with very practical examples.
So
what we say is that the variance of the ROC area is the least burdensome
approach to putting quantification into this problem. I remind you that is something that we are supposed to enable
sponsors to appreciate.
Another
thing that we realize in many discussions with academics and within our house
and with the sponsors and so on is if you want to live in both of these worlds,
that requires consistent conventions.
If you want to be able to either get categorical reporting and the
BIRADS reporting, that's a lot of work to try to get people to be consistent
that way. People have dropped the
categorical scheme for all practical purposes.
Even
if you want people to be consistent between BIRADS and the quasi-continuous
scale, that's difficult. We've seen a
lot of data in our own group and from some of the universities. When you train people, this can be done but
not everybody is trainable right away to be able to do this so it's an
issue. To get data in both worlds then,
it's going to require some convention development.
My
final point here says this may require consensus bodies to promote the
practice. We would hope that the
American College of Radiology, some of them other professional societies, and
even the fact that this is of interest to NCI and the FDA, we would hope that
some this would encourage people to try to do measurements so that we could get
both the point and the curve. Then I
think everybody would be happy.
Well,
this brings us to a little interim here.
Some of you are very familiar with the next few slides. These are what we call the most famous
slides in the RC archives. Those of you
who know Charles Metz have seen this many times and his followers will use
these many times. Charles died using
these slides over 25 years ago.
Here's
the classic question. You have two
diagnostic modalities, modality A and modality B. Which one is better? You
look at them and you have people doing public policy thinking in their
minds. Which one of those is
better? You start calculating something
you've seen in a statistical decision theory book.
But
the way this is approached in the field of medical imaging is the
following. There are several
possibilities here. Those two points
may lie on completely different ROC curves.
In that case we say that modality B is unambiguously better than
modality A because at any false positive fraction the sensitivity of A is lower
than that of B.
There's
a different scenario. The two points
could fall on the same ROC curve. Then
you have these same people scratching their heads and saying, "Where
should they really operate?" Well,
in principle we believe that readers can move their level of
aggressiveness. Not on any fine scale
but we know that they adjust depending on the risk group their seeing. Some people do move around on their ROC
curve so in principle these two points are in equivalent modality.
As
I say, people will for years say, "There must be one of these operating
points that's better than the other."
Remember when I showed you that data from Craig Beam you saw people at
every level of aggressiveness. Each one
of these people in some way thinks they've optimized.
This
is what we call the expected utility function or the expected value function. Every one of those people thinks in some way
they have found the optimal operating point but they disagree with each other
so this is another reason for using the ROC method.
There's
yet another scenario. ROC curves may
actually fall in such a way that modality A is everywhere higher than modality
B. For the same reasons we would say
that modality A is the superior modality in this scheme. Three different possibilities. B higher, equivalent, A higher.
This
is the motivation for trying to get a finer measurement on this hundred-point
scale. Then if the clinicians really
want to know about the actual operating point, that is another step and we are
all for that if you can coordinate the measurements but it's very difficult to
do that.
Well,
I'm sure many of you are sitting there thinking what about if the ROC curves
cross? We know if that happens the
situation enters the world of ambiguity.
Then you can no longer necessarily use the total area under the curve as
a sufficient summary measure of performance.
Other
summary measures may be necessary.
There are any number of other ways to make a summary measure of curves
that cross. You can use partial areas. There's actually software even for that
today. Or you can use parametric
summaries of the curve and there are several other ways to look at this.
If
you decided you're going to use other summary measures, if you anticipate this
possibility, the study protocol is expected to address this because if you wait
until after the study and say, "I was going to use the partial area in
this region," we have a name for that.
That's called data dredging. You
have to build that into your study up front.
Otherwise, when people do not expect to see the curves cross in any real
way, they tend to use the area under the curve as a summary measure.
Well,
for submissions as are coming before us in the area of computer-aided detection
schemes, there is a question of how do you keep score for the location
scored. I must remind you this is
shocking to people who have never heard this before.
The
basic ROC paradigm is an assessment of the decision making at the level of the
patient. You don't say, "Where
does the patient have diabetes?"
You say, "This patient has diabetes." Or you say, "This patient has
TB." You don't say, "The TB
is here." You say, "This
patient has TB." So the score
keeping until recent years has been based on decision making at the level of
the patient.
In
more complex imaging you want to do the assessment of the decision making at a
finer level. You would like to assess
how well the localization was done.
Well, there are little errors there that come across funny. If you do localization, of course, you will
be providing the experimenter with more information.
If
you have more information in the study, you get more statistical power. The trouble is to do all this adds
complexity to the experiment. I would
just like to review for you a couple of the highlights of the issues that have
come up when you try to do location specific ROC analysis, so-called LROC for
location specific ROC analysis.
The
biggest problem is that if you want to keep score of a hit, the measurement of
the hit depends on the criterion you use for localization. If the legion really is here somehow and you
draw your circle and you say the legion is here, there is a certain amount of
overlap and you would be surprised to see how sensitive the measurements are to
that degree of overlap to the criterion you use for that. That's a real issue. There's no unique result. There's no unique LROC curve at the moment
for the state of the field.
There
are a couple of subtle points here that are very technical. I would just like to mention one of
them. People have studied this for 20
or 30 years. For a certain class of
problems if you study the ROC and if you study location specific ROC, the
curves in the summary figures tract with each other monotonically.
If
the one goes up, the other goes up. If
one goes down, the other comes down.
They might change at different rates but they go together
monotonically. So people haven't felt
bad about just using ROC analysis instead of LROC analysis if they were willing
to invest the extra resources because you will lose statistical power.
But
people have been willing not to go to this level of complexity and to go to
that higher level of complexity requires more elaborate models, more elaborate
assumptions. These are still debated
until today. You can see in the SBIE
handbooks that people are debating this back and forth, Charles Metz and Dave
Chakraborty.
But
I must mention that a lot of progress has been made in this field. The bottom line of this slide if you haven't
followed any of this is that essentially there's a lack of validated software
for analysis of such experiments. Now,
Elizabeth and the MIPS, Medical Image Perception Society, website actually has
software for several of these approaches.
The
writers of that software feel very good about the state of their software but
there continues to be discussions in the field about how far have they
validated. Have they checked whether
the alpha level and the reject rate are agreeing and what is the power and so
on.
The
debate goes on but I expect that people coming down from Pittsburgh any day or
any week now saying, "You've got to start using this because it's been
validated." That's the state of
the knowledge right now. There is
software there but there are still people discussing the condition of the
validation of the software.
So
a few years ago to find some kind of a happy medium Nancy Obuchowski of the
Cleveland Clinic and colleagues said, "Why don't we just simplify the
task? Why don't we do something called
region of interest location specific ROC analysis. Let's only require localization to within a quadrant so you don't
have to say there's a lesion here or a lesion here. You just have to say I see a nodule in this quadrant. You require localization only up to a
quadrant."
Similarly
for the other quadrants you could say, "Why didn't we do it for octants or
16 fold or 32 fold?" Well, you
could. This is sort of the entry level,
this problem, but as you add number of possibilities, then you get more into
questions of overlap and ambiguity so people have decided, "Let's start at
the level of just quadrants." As I
say, sort of the entry into thus problem.
Continuing
on discussing this so-called ROI approach, the location specific ROC analysis,
right away Dave Chakraborty jumps into the literature and say, "Wait a
minute. This doesn't correspond at all
to the clinical task." People have
debated that back and forth whether it does or not.
But
from the other wing of this Greek chorus comes the methodologist to say,
"Yeah, it may not be quite right but it's really straightforward to
account for correlations without getting into these assumptions that people
have debated for a while."
What
do I mean by that? Here are four
quadrants, the right side of the lung, the left side, the top, and the bottom
if you will. Whatever is going on in
this quadrant is expected to be correlated with what is going on in this
quadrant, or at least could be, and similarly across the quadrants.
After
all, this is the same person, has the same genes, experienced the same
environment, and had a picture taken with the same imaging system. One has to allow for the possibility that
these quadrants are correlated. The
nice thing is that Carolyn Rutter and others came by another year later and
said, "Wait a minute.
All
you have to do to preserve those correlations is when you resample you resample
on a patient basis. You can't start
resampling products this one from this person and this one from that
person. You have to resample on a
patient basis so if I sample you, all four quadrants from you come into that
sample and so on. When you do this, you
actually preserve the correlation structure and you are said to be using the
patient as the independent statistical unit here.
Well,
that's all I'll be saying about location specific score keeping and now to one
of the real problematic issues in the submissions as we'll be seeing in the
next couple years. This is the problem
of uncertainty of truth state. There's
a classic paper that all of us have almost memorized by now from Revesz, Kundel,
and Bonitatibus 20 years ago.
This
is Harold Kundel known to many of us as one of the pioneers of this field, the
mentor of someone on our panel today, who was at the Temple University, and now
is at the University of Pennsylvania emeritus.
These authors, what did they say?
They included various ways of obtaining panel consensus truth.
They
actually did a study comparing three different ways of doing chest imaging and
they had the truth but they set the truth aside. They said instead of depending on the truth to keep score, let's
get a truthing panel. What they found
out was they had several ways of obtaining consensus from that panel. They
could either use unanimity. They could
use majority. They can use some kind of
expert review. They have three or four
ways of reducing this panel to truth.
They compare three imaging modalities, as I said, and here's what they
found. Any of the three imaging
modalities could be found to out perform the others depending on the rule you
used for reducing the panel to truth.
So
this sobers a lot of us in the field about using a panel as truth. However, today the target of this experiment
we'll be discussing today is not to say this is a nodule that is a cancer. It is only to say this is a target. This is a region that a panel of experts
would consider to be an actionable nodule.
We're
not trying to keep score based on the truth.
We're trying to keep score based on what would a panel of experts
do? Would they cue this region or not? Nevertheless, even though we changed the
target, this classic reference above tells us that there's going to be
additional uncertainty because of this panel.
The panel will have variability in it and if you go to RSNA over the
last few years, you'll hear papers on this subject.
What
we've said to incoming sponsors is that we strongly encourage you to resample,
to come up with some resampling schemes to resample the panel to get a feel for
the additional uncertainty that comes into this problem over and above the MRMC
paradigm, over and above due to the fact that there is noise in the panel. You can start to see why there is no canned
software to do this problem.
Well,
since the truth is uncertain, it turns out that leads to uncertainty, in
effect, in the number of samples you have.
Let's talk about designing an experiment for a moment. Suppose you want to design experiments that
are going to have very tight error bars on the sensitivity. Everybody know that if you want to do that,
you want to have a lot of actually diseased cases to tighten up the error bars
this way.
If
you want to tighten up error bars the false positive way, you wouldn't have a
lot of actually non-diseased cases. If
your endpoint is the area under the RC curve, what distribution should you have
between nondisease and disease cases?
Well, it turns out it should be some kind of average between the
two. It turns out that the number you
should be using is the harmonic mean of the numbers in the two classes.
The
numbers in the two classes is going to depend on the panel, right? Because some of the panel members will say
these are diseased and others will say these are diseased. The actual number of diseased cases depends
on the panel. We have uncertainty in
truth that leads to uncertainty in the number of samples.
This
is almost a trivial curve and I'm just going to tell you about the highlights
because we think it might factor in today.
Suppose you are told you can design an experiment with 100 patients. You say, "How should I distribute
them?"
Well,
you distribute them, let's say, at the beginning of an experiment like this so
that you have 20 that are actually nodule containing cases, 80 non-nodules, 20
nodule containing sites so we have an 80/20 break.
This
effective number, this harmonic mean of those two numbers, is 32. Whereas if I make a more even split, 60/40,
50/50, for 60/40 it would be up in the 40s the effective number. On a 50/50 split the effective number of
samples for that experiment would then be 50.
That's not surprising.
The
reason we're showing this is suppose you start out with an experiment like this
and you are requiring unanimity in the panel to declare a nodule-present. Then suppose you relax that criterion and
say instead of requiring unanimity, we'll just require two out of three. Then you expect that whatever the number was
before you're going to move up this curve.
So
you are sampling variability, losing power, but gaining samples. You may tend to cancel. We don't know this. We are speculating about this. We'll discuss this. What I just said is if you want to get into
the realm of resampling your panel, you could start by relaxing the panel
criterion from unanimous to majority and there are several other ways of doing
this.
This
is just, again, an entry level. When
you do this, this gets you into the game.
This allows you to resample, to assess the variability, but it may also
increase the effective number of samples.
These effects may tend to cancel.
This is, again, speculation just based on the direction of these
effects.
The
last thing I want to talk about today is the problem of controlling for reader
vigilance. When you do an experiment,
with my two little pads of paper here, when you read in the unaided reading
condition versus reading the aided reading condition, there are some people in
this room who may be competitive.
If
you're reading in the unaided reading condition you say, "The computer is
about to tell me what it thinks."
If you are a little bit competitive, you are going to say, "I've
got to be careful when I read this."
You may increase your vigilance.
How
do you mock up? How do you do this
experiment? This is a challenge that
hasn't been quite sorted out. Any
measurement setting has an artificial condition compared to the actual real
world of practice. What I just
described to you is the possibility that some readers might be more vigilant in
their unaided reading because they know they are subject to the site.
Well,
when you turn a modality lose in the real world, just the opposite could
happen, right? The readers might be
less vigilant in the real world because they know, "Well, I can brush
through this. The computer is going to
give me what it thinks in just a minute."
In the real world the vigilance could go down. In some experimenters it could go up and I think we've seen
experiments when the vigilance didn't change but I'm sure you can guarantee
that.
The
only thing we've seen in the practical solution to this problem, Heang-Ping
Chan and colleagues about a dozen years ago wrote a paper in which they said,
"Look, this is a real issue, this vigilance.
How
do you do a controlled experiment controlling for reader vigilance?" They said, "Well, just simply control
the time available to readers in the unaided reading condition to mimic the
actual clinic. That was a suggestion I
made. I don't know how many people have
tried that yet but that's in the air.
Well,
you can all take a deep breath now.
We're in the summary. Here we
are. This field has been going on for
30 years. In the last 10 years the
whole issue of reader variability has complicated it and there have been ways
to promote it to address the issue of reader variability.
In
the last few years we've had to deal with the complications from location
uncertainty, from uncertainty in the truth, this issue of reader
vigilance. What we've tried to do is
this is like a quadrangle, as I said.
We hear it sitting at the FDA and also doing some research here.
We
have our academic colleagues doing research in academia, industry sponsors
doing research on all these issues in another side of the quadrangle, and NCI
and the Lung Image Database Consortium that we've been very actively working
with and who are very interested in these issues.
We've
tried to hold the windows open so that this quadrangle from all courts has been
open to everyone. Whenever industry
sponsors have come in with issues like this we've said, "Look, the windows
are open.
Here's
what is known from all these quarters.
Here are the papers. Here are
the drafts that are not even published yet.
Here's what we know at the moment.
We don't have guidance. We can't
say this is where the FDA or anyone is holding the bar but this is all the
knowledge that we have at the moment."
There
is no canned software. There's canned
software for little pieces of this problem so any industry sponsor would have
to be creative to come forth with a novel way of putting all these pieces
together.
Well,
that's the state of the world as we know it today. Thank you very much for your interest in this. Oh, there's some papers. The "tz" are obviously Charlie
Metz's papers. There are a few papers
from our own group in which we have actually worked with Charlie Metz and our
own statisticians and our clinicians try to review the state of the world.
This
is the first LIDC document. It's going
to come out in April. Then in your
notes there are many other pages of references.
DR.
IBBOTT: Thank you, Dr. Wagner. Before you go too far, I would like to ask
if there are any questions from the panel for Dr. Wagner.
DR.
KRUPINSKI: What's the consensus? I mean, the quadrant problem gets rid of the
localization problem if you end up with a nodule in each quadrant. What it still hasn't addressed, what do you
do, for example, when you've got two lesions in a quadrant?
DR.
WAGNER: That's right.
DR.
KRUPINSKI: You still have that basic
uncertainty.
DR.
WAGNER: That's right.
DR.
KRUPINSKI: The flip side of that is
what if there is a false positive in the quadrant along with a true
positive? You've just simply squished
it --
DR.
WAGNER: That's right.
DR.
KRUPINSKI: -- into a quadrant and you
still have avoided the localization problem and the problem of a false positive
and true positive.
DR.
WAGNER: That's right. That's been sidestepped. As you know, the higher levels of software
attempt to address this one way or another and I think the jury is still out on
whether we are ready to use that. I
think the inventors of those other methods think they are ready to go and they
might be but we also know there are people in the wings saying I'm not sure
about these assumptions and so on. That
software does not have general providence right now. Maybe that's too bad.
Maybe it should be. These are
real issues.
DR.
BLUMENSTEIN: I'm impressed by the MRMC
study design. I think that's a nice
step forward. I'm wondering if anybody
has ever subjected the same reader to the same image multiple times and studied
the effect of that so that you could get at this issue about how a single
reader uses their own personal scale?
DR.
WAGNER: Yes. That's a classic question.
There are experiments on that.
I'm making this up but this is the spirit in which I remember it. David Getty has shown some data on this in
mammography and I think that readers are correlated with each other in the 60
percent range and are correlated with themselves only 70 some percent on
repeats. There is, indeed, a lot of
reader variability intro.
However,
you get more bang for buck -- if you want to spend so much time in radiology
reading-wise, there's more bang for buck to get a different reader than to use
the same reader over again because you are so correlated with yourself you get
more independent information if you bring in a sample that's not so correlated
with the preceding reads. Bank
for buck-wise people have said this is a question of reading time. People have not in the MRMC paradigm in
general tended to have readers reproduce their readings. You can do it and there are terms in the
model to accommodate that, of course.
It's just not common.
DR.
BLUMENSTEIN: Actually, you took my
question as a suggestion maybe of changing the study design. I didn't make it clear. What I'm actually concerned about is whether
the methodology that's been developed to give p-values, estimate variance,
which you rightly point out are the big issues here, whether those properly
account for intra-observer variability in their use of the scales?
DR.
WAGNER: I believe it does and I'll tell
you why. The full model has seven
terms. I won't take you all through all
of those seven terms. Pure case, pure
reader, various interactions. One of
them is a three-way interaction between modality reader and case.
That's
the sixth term. The seventh term is
what you're talking about. It's the
lack of reader reproducability. If you
do enough experiments, you can identify so-called in statistical language. You can separate these two. If you don't do the right experiment, you
can't but they get lumped together. The
term you're trying to get at is the reader inconsistency. That is sampled in the experiment but it
cannot be identified. It cannot be
broken out but it is in there.
In
fact, the way we do it is we do it with a family bootstrap experiment so we can
actually put out all these effects but we cannot pull out the MRC from the
epsilon. They come together. That represents not only this three-way
interaction but represents the inconsistency of all the data sets
together. So that is actually in
there. Are you surprised?
DR.
BLUMENSTEIN: No, no, I'm not. But since you don't measure that in the
experiment, you can't estimate it obviously.
That's the issue. I guess what
I've been concerned about ever since I first heard about the use of ROC curves
where the reader is recording their result on a subjective scale either
categorical or probability or whatever it is.
It's a device to get
you to the point of being able to use ROC methodology. What has always concerned me was that there
was this underlying source of variability that wasn't taken into account in the
models that you are estimating. It's
only if you do the experiment that way that you actually get an estimate of
that intra-observer or whatever you called inconsistency or whatever.
DR.
WAGNER: Right.
DR.
BLUMENSTEIN: I just wondered whether
the degree to which this has been studied in actuality.
DR.
WAGNER: Not very much because of the
bang for buck point. As you can see, if
you are inconsistent with yourself, and everyone is, that will show up in case
to case within a given experiment but you won't be able to peel it out but it's
in there and it's accounted for in the inference. It's a subtle point but we can discuss it.
DR.
TRIPURANENI: That was an excellent
presentation, Dr. Wagner. We used the
MRMC for the intra-observation. If you
are looking at two different modalities such as a chest x-ray or a cat scan,
have you looked at whether there is any difference in the intra-observation
between one modality to the other modality?
DR.
WAGNER: It turns out to be a really
neat point actually. Our own group has
three papers on this subject. In the
first one, you want to know if you can see the difference in the variance
structure between the two modalities.
Is that what you're asking?
DR.
TRIPURANENI: That's right.
DR.
WAGNER: There's a model that has six
terms. We were just talking about
that. There another model that -- you
would think you would have to go to 12 terms to do that. It turns out there is a parsimonious way to
do it with just nine terms but two ways to do that.
When
you do it you find out that the extra issues brought up by the wrinkles you
were just discussing, they come in in such a way that they average and it's
only their average that goes into the inference so you can forget about the
issue. It's a really interesting
issue. We have two papers on it. But you could forget about it. You could from right off the metro just hear
about this and say, "I'm going to use the DBM software." You could forget about the difference in the
variance structure across the competing modalities and if you do, the inference
is still the same inference. It doesn't
matter. It's a really interesting
point.
DR.
IBBOTT: Dr. Solomon.
DR.
SOLOMON: How do you -- I mean, I have a
feeling this topic is going to be discussed throughout the day but how do you
translate changes in ROC curves into clinical significance? Especially since if you look at an
individual's change in the ROC one person might do worse and another person
might do better and then how do you make that determination?
DR.
WAGNER: Right. Well, you might have been a fly on the wall
in many meetings. I mean, this is a
real issue. Dr. Sacks will say
something about it later on. All I can
tell you is that the most statistical powerful method to get at these
differences is the one I've discussed today.
We really would like -- well, I take you
back to the Yulei Jiang stuff. We
really do want to see those action items.
You can't go from the curve easily to the action items if you haven't
measured those action items. Is that
what you're getting at? I'm not sure I
see what you're getting at.
You
want to know how we can go from this ROC summary and inference to an
interference to the clinic. Is that
where you're going? I think it's
difficult. What we're saying here is
what we are doing is we are making a measurement that averages over all these
variabilities that we have talked about.
It averages over all that and here's the summary.
If
you want something more clinically relevant than that, you would have to
actually measure the action item, the dichotomization, if you will, and give it
error bars. When you finish the problem
is here would be the action item sensitivity specificity for the one modality
and here it would be for another one or this way. Now, what do you do?
Suppose
they go this way? What are you going to
do at this point if they don't match up sensitivity wise or specificity? What are you going to do? There are things you can do but you have to
start getting into expected utility analysis.
I didn't mention it but I have some very strong professional opinions on
this.
I
think it's impossible to do that because to do the expected benefit analysis
you need to have an idea of the prevalence of the disease and that changes from
risk group to risk group so that is a big uncertainty. You have to have a sense of something called
the utility matrix, the number of false alarms that you are willing to trade
for a hit, if you will, different from the positive predictive value.
You
have to have a sense of that utility matrix and you have to actually know the
ROC curve already because all these things come in. I think this is almost impossible to do without this being taken
on at a national level.
You
can see from the data of Beam, et al. each one of these people thought that
they were working out the optimal operating point and have completely different
points of view. What I'm saying is
that's an important question. I think it's a societal question.
I
think it's very complicated and it calls for a lot of wise people with a lot of
data to sit down with professional societies and say, "Where are we and
were do we want to be?" This is a
really big issue. I don't have an easy
answer. I insist to my colleagues there
is not an easy answer.
DR.
IBBOTT: Brent.
DR.
BLUMENSTEIN: I think it is the key
question. What we are asked to do here
is to basically judge whether this difference in the area of an ROC curve --
DR.
WAGNER: That's right.
DR.
BLUMENSTEIN: -- has any translation to the clinical setting. What we're lacking we have a measure of the
significance of the difference in the area of the ROC curve. What we don't have is a measure of
uncertainty around the clinical interpretation of the ROC curve.
This
is what is particularly bothersome to me is I don't know how to do that and I
don't see any methodology that gives me that answer. I'm concerned that we have started building a building with a
foundation using subjective scales to measure things so that we can use ROC
methodology and we are using resampling methodologies to do this.
We're
not taking into account all the various sources of variability and so forth so
we are way out there and our foundation may be collapsing and not giving us
what we need with respect to the clinical outcomes.
DR.
WAGNER: Well, if this was broadcast on
academic TV today, apoplexy would abound in the community because we all feel
we are building, as you say. We're
building on decades of people trying to measure complex perceptional
phenomenon. This is where we are right
now.
It
may not be the ending point to which you would like to be but this is about the
best of where we are at the moment. I
tried to challenge you a moment ago if you wanted to work on any action
oriented clinical endpoints, I think it's very difficult to sort that out.
It's
very difficult because you'll get bigger error bars and it's very difficult
because the expected utility problem is one that every person in this room has
a different answer to that problem. I
think it's very difficult. I agree with
you that we are constantly besieged by our clinical colleagues who would like
to have better answers to this problem.
One
case which is kind of unambiguous is the Yulei Jiang's data that I showed you
had an ROC curve that went up. The
unaided condition was lower. The action
item, the dichotomization went from a certain sensitivity to a higher
sensitivity and a lower false positive fraction.
I
think everyone loves that scenario.
Wouldn't you say? That's the
world we want to live in. Right? That doesn't happen a lot. These more ambiguous things happen more
often. So what we can do is average
over the relevant parameters and say this is what we found.
In
principle if one ROC curve is higher than the other, in principle one can
operate at a given false positive in one modality and increase the
sensitivity. For every time B is higher
than A, if the specificity is here and the curve is everywhere higher, in
principle I can operate at a higher sensitivity. In practice how to do that, wide open. This is a professional society issue that is bigger than all of
us. That is a really tough
question. I agree.
DR.
BLUMENSTEIN: And just to throw one more
complicated issue into all this is that a lot of this stuff that you presented
here assumed that the modalities were assessed independently. In other words, modality A versus Modality B
but the experiments that we are asked to look at are modality B added to
modality A.
DR.
WAGNER: Right.
DR.
BLUMENSTEIN: Where the experiment
itself has built-in constraints with respect to how one behaves in doing
that. I don't see that taken into
account.
DR.
WAGNER: No.
DR.
BLUMENSTEIN: And I'm concerned about
that.
DR.
WAGNER: This is a point of confusion. I
would disagree with you. The modality A
here is the reader unaided. Modality B
here is adjuvated, the reader aided by the computer aid. This a standard paradigm and it actually
corresponds to an experiment in the real world that you would like to do.
It
may not line up exactly with the clinical setting but you actually would want
to know something about the performance of readers unaided and then you want to
know about how they would perform in the aided condition. That is actually the comparison of interest.
DR.
BLUMENSTEIN: I realize that but the way
in which the data are recorded is such that the judgment -- as I understand it,
the judgment under A is there and has never backed off. You could only improve.
DR. WAGNER: Oh.
DR.
BLUMENSTEIN: And that's not taken into
account in any of these models that I see.
All the models that you presented, everything that you said, is based on
having an independent assessment of the two modalities.
DR.
WAGNER: Well, you have also touched on
something that we have had a lot of discussions on. These are real issues.
I'm not making light of anything you're talking about here. One hopes the day will come when these
modalities are really good. These
computer aids are really good and then you'll be allowed to back off. You could depend more heavily on the
modality.
Today
people are being encouraged not to back off but the measurement doesn't require
them not to back off. They are just
encouraged, "Do not back off," and there is a basic reason for that I
think Dr. Sacks will explain later on so people are encouraged not to back
off.
But
when the systems are really good as they are in mammography, these
computer-aided systems in mammography are almost flawless for picking up
clusters of microclassifications. They
are far from perfect for masses but they are almost flawless for
microclassification clusters so readers have thrown away their eye loops, a lot
of them that are using these systems so they are willing to depend on the
computer.
I'm
just giving you the only anecdotal evidence.
You have a really good point. I
don't have a really good answer to it but in principle it doesn't have to be
this way. At the moment it is this way.
DR.
IBBOTT: I would like to remind everyone
we will have time to discuss this specific proposal in front of us later on
this afternoon.
DR.
STARK: May I ask a question exactly the
point of the presentation, I believe?
DR. IBBOTT: Yes, please.
DR.
STARK: Using the classic -- thank
you. That was an outstanding
presentation.
DR.
WAGNER: Thanks.
DR.
STARK: Let me just get to the point
because I know we are running short on time.
With a better test the AB test in come context in terms of clinical
utility, either one that had less scatter.
You showed the Beam paper where the radiologist skills cause scatter in
the distribution of the family of curves.
It
would seem to me that there would be two criteria applicable here where we have
a different choice where the test with the larger Az is not the better test if
that test is less flexible -- I'm sorry, has a larger scatter in terms of
variability of radiology performance, radiology implementation creating a
management problem, the implementation problem and then the clinical utility
problem where all of the fabulously sophisticated group here are focused on.
The
other area where the larger Az -- so if there is more scatter in the test with
the larger Az, it will likely be an inferior test, more cumbersome, more costly,
less safe and less effective in clinical utilization.
The
other thing is that if there are two tests with comparable scatter but is
easier to train with experience or inexperience, so if you have a trained panel
of readers like you do under these study conditions under very circumscribed
conditions where they know they are in a test and are not distracted by
clinicians, by the busy realistic environment of all mammography or chest CT
practices, you can have a curve that is more pliant in the direction that you
want doctors to either start at with distractions or to move into with
experience so it does seem to me that the scatter or the flexibility of the
performance.
The
ROC curve I think is unassailable and I have learned -- I have enjoyed a ton
here learning from Dr. Blumenstein's analysis, yours, and those of you have
seen whatever I wrote here. My group
had to do this 20 years ago. We
published papers on ROC analysis and I know we're on the right -- I believe
we're on the right foundation.
I
think this is the right place to start but the breath of the challenge facing
us all here today is let's not get obsessed with the ROC curves. I know we have the whole day for this but
the safety and effectiveness of this is going to be what happens when you drop
into a clinical environment.
And
we have a lot of experience with breast and this panel has a lot of people
experienced on it but can you tell me if you would agree that we need to see
the scatter in these Az plots and know how they respond to inexperience or
training to really know of the larger Az is better.
DR.
WAGNER: Well, I would say that I think
there is a little bit of second order phenomena here that is important. Just because something is second order
doesn't mean it's not important. For
the practical inferences that have been -- the endpoints of studies we've seen
to date, it has been the performance in the mean.
People
have addressed that. There is
software. We have several papers on how
to do just what you say and how to split out every piece so we can see how much
variation is from the cases, from the readers, from the various
interactions. There is actually
software to do that and we are encouraging people who operate at a higher
level, say NCI or some academic consortium, to address these very issues and we
can see it. We know how to peel all
this stuff apart. As far as the
inference on the table today, it was not done.
DR.
STARK: The burdens would be huge. I mean, the sample sizes, the whole time period,
the number of people that have to be involved.
DR.
WAGNER: That's right.
DR.
STARK: That's why you talked about the
need for national studies and we would all like to do that in oncology and
everything but we have to treat people and make decisions today.
On
the other hand, let me ask my final question.
Are you aware, or is anybody aware of any evidence that a p-value or
some other statistical measure comparing your test A, B under whatever
conditions, today's conditions or the ones I am dreaming about, we hope it has
some clinical relevance but couldn't it all be counter intuitive? I mean, this is a very subtle business and
couldn't we be missing the forest for the trees here?
DR.
WAGNER: Again, that's a very wise
question and I think that is why we have several medical officers involved in
our center on the panel here so I'll defer to them.
DR.
STARK: So the p-value of .003 doesn't
necessarily mean a thing.
DR.
WAGNER: I defer to my clinical
colleagues for that.
DR.
STARK: Thank you.
DR.
IBBOTT: I want to make sure that we
give Dr. Mehta a chance to ask a question if he has one. Dr. Mehta, do you have any questions? He may not be able to hear me.
DR.
MEHTA: No, I don't have any questions.
DR.
IBBOTT: Thank you.
All
right. We are a few minutes ahead of
schedule at this point so we'll take a short break. Let's make it 10 minutes and we back at 10:50.
(Whereupon,
at 10:40 a.m. off the record until 10:55 a.m.)
DR.
IBBOTT: Take your seats, please. I'd like to continue the panel now if you
will take your seats, please. For those
of you who are like me are concerned, we are getting the heat turned down in
this room. At least in one sense.
We
will now proceed with the sponsor's presentation which will be introduced by
Dr. Kathy O'Shaughnessy who is Vice President of R2 Technology. Dr. O'Shaughnessy.
DR.
O'SHAUGHNESSY: Thank you very
much. Dr. Ibbott, we are very pleased
to be here today to present our image checker CT CAD software. I would like to introduce the attendees that
are here from R2 and some consultants that we have come to -- we have asked to
be here today to both present and answer questions from the panel.
Besides
myself from R2 Technology there's Dr. Castellino, our Chief Medical Officer;
Dr. Wood who is the head of our CT Products group; and Mr. Schneider who is the
lead algorithm architect that designed the algorithm that we are reviewing
today.
In
addition, we have asked the following people to join us. Dr. Delgado was a beta user of the system so
he can describe a little bit about his experience using the system at his
facility. Dr. MacMahon is a thoracic
radiologist from Chicago with extensive experience in both CAD and ROC
research. Mr. Miller is a
biostatistician for the study. Dr.
Stanford was one of the site investigators where we collected cases from one of
the sites.
Here
is a brief overview of our agenda.
After my introduction we'll go into the current clinical practice for
some background on lung CT and, in particular, the detection and management of
nodules and lung CT images. Then we'll
describe the device both in terms of how it works and how the user uses
it.
The
clinical study will start first with how we collected the cases that were used
and then go into detail into the methods and results from the clinical
study. After that we'll have a brief
discussion, presentation about the beta test that describes a little bit about
the usability of the system. And I'll
finally summarize.
Before
we move into the presentation, I wanted to put out our proposed indications for
use of this device. I thought it was
important to go over this to sort of put what we are presenting today in
context. The image check for CT is a
computer-aided detection or CAD system designed to assist radiologists in the
detection of pulmonary nodules during review of multi-detector CT scans of the
chest.
It's
intended to be used as a second reader alerting the radiologist after his or
her initial reading of the scan to regions of interest that might have been
initially overlooked.
I
would like to ask Dr. MacMahon to come to the podium, please.
MR.
MacMAHON: Thank you. Again, I'm Heber MacMahon. I should say I have a small equity in R2
Technology. The company has also paid
my time and expenses for this meeting.
I
would just like to make some brief comments about the actual clinical practice
of radiology as it relates to thoracic CT scans and the importance of detection
of pulmonary nodules.
Some
of the common indications for performing thoracic CT scans would include
characterization of an abnormal finding on a chest x-ray. In this situation an abnormality may have
been detected and the purpose of the CT scan would be to characterize it as possibly
a lung cancer. And in addition to
detect additional abnormalities that might be relevant such as metastatic
nodules.
We
also used thoracic CT scans extensively for staging and monitoring lung cancer
and other kinds of tumors. In this
situation we are looking not only for pulmonary nodules, but also for enlarged
mediastinal lymph nodes and upper abdominal abnormalities.
In
the case of extra-thoracic tumors we are commonly also looking for pulmonary
modules and for enlarged lymph nodes in the mediastinum. Then there are a range of other applications
of thoracic CT some of which are developing and will be used more extensively
such as detection of pulmonary embolism.
However, in all these
situations, although the pulmonary nodules are not the primary focus of the
examination, there is an opportunity to detect pulmonary nodules that may be
present in the lungs of these patients.
Finally,
lung cancer screening which is investigational and depending on the outcome of
the ongoing NLST study may be used more widely. And, of course, in lung cancer screening pulmonary nodules are
the main focus of the investigation.
But
the point I would make is that lung nodule detection is a requirement in every
chest CT scan no matter what the original clinical implication. Only when the radiologist has detected a
nodule can he or she decide what course of action is then appropriate.
There
are various management strategies that can be used to manage a pulmonary
nodule. In order to determine whether
it's an actionable nodule, we need to consider the size. Generally larger nodules are more dangerous
and more likely to be cancerous.
We
consider the shape whether it's spiculated, ground glass, and so forth, whether
there's been integral change from a previous examination in the same
institution and that would be part of the normal diagnostic process to make
that comparison. We would consider, of
course, the clinical context, the age and gender of the patient, smoking history,
and so forth. There are a number of
factors that play into that decision in addition to the image itself.
If
the nodule is considered actionable, we can recommend a number of courses of
action. One of the most common would be
to obtain outside prior imaging studies from other institutions. If we can establish stability over a period
of time, no further action may be necessary.
Follow-up
CT scan might be prudent at anything from three months to 12 months depending
on the nature of the nodule and the radiologist level of suspicion. Other kinds of imaging studies such as a PET
scan may be applicable, especially in larger nodules that are in the range of 8
to 10 millimeters. This may distinguish
cancer from a benign nodule,
Finally, we can consider biopsy, either
transthoracic needle biopsy, bronchoscopy, or thoracoscopic resection.
Just
to illustrate the clinical problem, here is an example of a very small
pulmonary nodule which I think might easily be overlooked in clinical
practice. It's almost indistinguishable
on the single section from surrounding blood vessels but this is, in fact, a
small lung cancer which was detected one year later, as you can see, at which
time it is much more advanced.
So
this is a very challenging problem for radiologists to visually attack these
very small nodules and CT scans. We are
aware that we do miss nodules and I'll just cite two particular studies of
interest that have addressed this issue of missed nodules and CT scans.
Dr.
Hartman and others at the Mayo Clinic looked at over 1,000 screening CT scans
and compared them with prior screening CT scans one year earlier to see how
many nodules may have been overlooked.
They found that as many as 24 percent of the prior prevalent scans had
nodules that were not recorded at that time.
This
might seem an astonishingly large number but this is consistent with some other
studies. Now, a large number of these
nodules were relatively small put more than one-third of them were about three
millimeters and in the size range where they are likely to be considered
actionable.
And,
in fact, 6 percent of them had grown which would mean that they were highly
suspicious for lung cancers so there seems little doubt that nodules are being
missed even in excellent centers such as the mayo clinic in a study that was
focusing specifically on the detection of nodules.
One
other study performed by Gruden and others at Emory University looked at 25
patients with presumed lung metastases.
These patients had soft tissue sarcomas and melanoma and they established
truth by consensus which is a practical method using five readers. These nodules were three to nine millimeters
in size and they were solid nodules.
Two to nine solid nodules in each case by consensus.
They
found that the miss rate for individual readers ranged from 20 percent to 39
percent of all of the nodules in this size range. This was in an observer test setting where the readers were
focused on detecting nodules and presumably had no other task in mind so one
would expect a relatively good performance in that situation.
So
between these two studies we can see that there is a considerable problem with
oversight errors in reading CT scans.
Now we have a trend towards thinner CT sections with the newer
multi-detector scanners. This allows improved
ability to detect and characterize lesions.
It does allow us to do a high quality off-axis reconstructions.
On
the other hand, it does present us with more image data, more opportunities for
error. In a chest CT scan performed
with a multi-detector unit we may have anything from 18 to almost 300 images of
the chest and the radiologist has to interpret those visually.
I
think that the evidence that we've seen strongly suggest that traditional
visual interpretation is no longer sufficiently reliable for detecting these
very small and potentially dangerous common nodules.
At
this point I would like to introduce Ronald Castellino, Chief Medical Officer
for R2 Technology.
DR.
CASTELLINO: Thank you. My name is Ron Castellino. I'm also a diagnostic radiologist but
currently I'm the Chief Medical Officer of R2 technology.
At
the outset I'd like to particularly emphasize the definition of computer-aided
detection which is also called CAD as we will be using it in the presentation
today. Computer-aided detection as we
use it refers to the availability of computer algorithms that automatically
identify regions of interest on a medical image for the radiologist to
evaluate.
It's
purpose, of course, would be to decrease what I would term observational
oversights. That is, findings that are
present on the image but, in fact, are not seen by the radiologist. This is not a device to tease apart very
unusual nodules that might not be present or barely present on the image. These nodules are actually clearly visible
on the image.
The
image check for CT CAD system specifically is designed to automatically detect
regions of interest with features suggestive of solid pulmonary nodules on CT
exams of the chest. It's important to
remember that it is to be used as a supplemental review. That is, after the initial assessment has
been made by the radiologist. It is not
a first reader.
The
radiologist, most importantly, remains responsible for the final interpretation
of the findings that the CAD marks may put on the image. That is, to determine if the mark is
actually a true mark or if it is a false mark.
A
brief review of the device description.
The CT scan is performed in the standard fashion. The images or the data set is moved to
increasingly types of work stations that radiologists review the images on and
what is what we call a soft copy display.
These images may be reviewed slice by slice but increasingly they are
reviewed in some type of a melt-through or a cine mode to facilitate reviewing
these hundreds of images that are generated.
By
the same DICOM standard the data set can also go through a server computer. Various image analysis algorithms can be put
into place. In this case, I point out
segmentation. This type of information
can also be transmitted to the work station to help the radiologist further
analyze the images and this is an image checker CT work station which was
cleared by the FDA in 2002. This is an
existing product that has been cleared.
The
same DICOM data set can also go through an image checker CT CAD software system
and provide on the work station CAD information as well. It is this specific piece of the product
that is under review today by the panel.
I'll
show you a few screen capture images of the front end of the work station on
which the CAD marks are displayed. The
view port on the right is familiar to radiologists. This is where we can see the axial images. I guess I can't use this thing. Thank you.
We are a high-tech business as you can see.
There
we go. On the large view port on the
right we can see the axial image displayed to the radiologist which is viewed
either singularly or, like I said, melt-through a cine mode. The smaller view port on the upper left is a
three-dimensional reconstruction of the contents of the lung.
You
can see the pulmonary vessels. In fact,
a few nodules perhaps you can see there. And the horizontal lines simply indicates to the radiologist what
level on the image the axial image is displayed. We see a nodule here quite clearly in the right apex.
The
radiologist then will move down the entire sequence of the lung in the lung windows
looking for other abnormalities, nodules as well as a multitude of other
features that the radiologist searches for sometimes seeing nodules and
sometimes not seeing nodules.
When
they completely review the entire study, which I'm giving to you in a very
schematic fashion here, the radiologist then will activate with a mouse click
the CAD button we call the R2 button.
At that point in time the CAD process takes over and presents the
following.
The
circles indicate candidate nodules that the CAD system has identified shown to
the radiologist on the three-dimensional display of the lungs, as well as
brings the radiologist automatically to that specific site where the nodule is
best seen by the CAD system.
In
addition, out other view port on the lower left is shown. This is a three-dimensional reconstruction
that can be rotated to separate the nodule out from adjacent vasculature. I would like to emphasize that upon the CAD
review the radiologist need not go through the entire data set once again but
simply by moving and hitting one of these little buttons here with a mouse
click which you can't read here. It
automatically jumps the image. By the
way, the size is automatically shown as well.
It
automatically jumps the image to the next CAD detected nodule and the next and
so forth. For example, this nodule, as
I showed you and, for example, a nodule at the right base which is clearly a
nodule but, in this case, had been overlooked by the radiologist on the set of
images.
That
is the CAD display on the work station.
What does the CAD search for? It
is specifically designed to search for solid lung nodules that are 4 mm. or
greater in size and we find that further as follows. They should have an approximate spherical shape.
The
margins can be smooth, lobulated or spiculated and should have soft tissue
density which we define as having average density of minus 100 Hounsfield units
or greater. Some of the typical CAD
marks you've seen already. They circle
the nodule. We consider this a true
mark if it actually encompasses the size of the nodule sometimes quite small,
moderate in size.
I
would like to emphasize that also although we look for spherical nodules if, in
fact, the nodule is adjacent to a plural surface where a portion of the sphere
is obliterated by contract with the plural surface. The algorithm tries to find these as well.
Secondly,
this image perhaps some of you can see, although it is easier for the
radiologist and the CAD system to detect a nodule that is surrounded by
completely normally aerated lung, if there is adjacent modest non-aerated lung
as we see here in the appended edema, the CAD algorithm often is successful in
teasing out the nodule as well.
There
are a multitude of other parenchymal abnormalities within the lung tissue that
the CAD algorithm does not search for.
The radiologist must look for these but the CAD algorithm does not
search for. For example, linear strands
which do not fit the criteria. I would
like to point out importantly although this fits the criteria of being a
spherical nodule, we call these ground glass opacities.
They
are increasingly noted to be of importance, particularly for lung cancer
screening programs that because of the Hounsfield density cutoff that we have,
this type of nodule currently is not searched for with our set of algorithms.
All
CAD systems have false marks. We see a
few here such as this one here where a branching vessel exist. The CAD algorithm thought this was a nodule
and marked it incorrectly. Plural tags
are at times marked incorrectly. I can
tell you that our experience internally as well as with users indicate that the
vast majority of these false marks can be readily dismissed as you see here.
As
an aside, we have found that a regulatory database a median of three false
marks per exam. I would like to
emphasize this is per exam. There is a
median of 160 images per exam so we're talking about approximately one false
positive mark for every 50 to 55 individual images.
Now,
the clinical study was designed around an ROC study as you've heard from Dr.
Wagner. It was done in close
collaboration and support with the people from the FDA. The ROC study in a large extent does measure
-- a combined measure of efficacy of safety.
There is some discussion about that and Dave Miller will fill you in on
that as we see it, at least.
There
are three parts. We've collected
cases. I'll review that. These cases were sent to a reference truth
panel and finally to the MRMC ROC study which you'll hear about from Dave
Miller.
I
would like to spend only a brief comment upon the target of nodules. You've heard from Dr. MacMahon that we are
increasingly seeing smaller nodules on our CT scans and our clinical practice. We wanted to design a CAD system to help
radiologist detect all solid nodules between 4 and 30 mm. That was the focus of our research effort.
And,
as you are well aware, those in the clinical practice you will recognize that
most lung nodules most of the time are typically sampled by biopsy or thoracic
resection if they are 8 or 10 mm. or so greater in size. There are obviously exceptions to this but,
in general, they are.
The
availability of a biopsy proven so-called gold standard to evaluate nodules in
this smaller size range was just not available to us. We settled on a gold reference standard of a consensus on
actionability as being the only practical standard that would capture all solid
nodules of clinical concern in this size range. We are really focusing and trying to help the radiologist in the
4 to 8, 10 to 12 mm. range. The larger
nodules, of course, radiologist will almost always see.
We
collected cases from five centers. They
contributed consecutive non-selected cases.
We tried to make this as representative as possible. They were all in adults. They were performed for a variety of
clinical indications. There were no screening studies in this group.
Cases
with greater than 10 nodules were excluded.
We felt that there were a multiplicity of nodules. The issues of searching for nodule where the
radiologist has already seen 8, 10, 12, 15 would be reported. The images, of course, have to reach certain
technical parameters.
These
cases were divided into two categories to begin with by report. The nodule-present cases had in the report
the presence of one nodule or more described by the reviewing radiologist. These patients by definition had a history
of biopsy proven documentary cancer either primary to the lung or in an extra
thoracic site.
We
did this to try to increase the likelihood that nodules in this group might
have clinical significance because they were in patients with cancer but I
would like to point out that the specific nodules themselves were not biopsy
proven. The nodule absent cases, once
again by report, no nodules were described within the context of the
report. These patients could have a
history of cancer or not.
The
final truth was determined by the reference panel which you'll hear about from
Mr. Miller. Five sites contributed to
the study. Three of these are community
imaging centers, two are university centers.
They were from the east coast, mid-west, west coast. There were 63 cases that had nodule-present
by report, 88 nodule absent by report.
You
can see the distribution between male and females were similar. The age range was similar in the two
groups. There was a slight increase in
median age in the nodule-present cases perhaps because they all had documented
histories of cancer as compared to this group. The type of cancer in the nodule-present case, 38 percent had a
documented primary lung cancer and 62 percent had documented extra-thoracic
primaries.
Here
are some of the parameters of the technical aspects of the case
characteristics, the median number of slices you see here. There is a slight predominance of thinner
slice sections in the nodule absent cases mainly because one of the centers was
doing much thinner slices routinely and they contributed a larger amount of
nodule absent cases.
The CT vendor's use in these five sites were
General Electric or Toshiba.
I
would like to ask Dave Miller to present the methods and the results of the
study.
MR.
MILLER: Thank you. My name is Dave Miller and I am currently
the Director of Statistical Analysis at Ovation Research Group. At the time that this study was conducted I
was the Director of Biostatistics at R2 Technology. R2 is paying for my time and travel. However, I do not have any financial interest in R2 Technology.
Just
want to quickly go through an outline of what I'm going to discuss because I'll
be up here for a little while. I'm
going to go through some definitions that I'll be using during the talk. Then I'll talk about the reference truth
panel. I'll talk about the ROC study
design, our primary analysis. Then we
did a large set of robustness analyses.
Then finally the study conclusions.
So
gold standard, and these are definitions that I'm going to use. They are not necessarily dictionary
definitions of these but gold standard is something that I'll define as an
objective and definite measure of truth.
The
reference truth is a truth standard for a subjective construct. It is a term that is fairly widely used and
it's a term that I'll be using here as a standard that's used in lieu of an
available gold standard. The kind of
thing that reference truths are used for are things like actionability where
actionability is something I'm defining as a subjective point-of-care decision
which is really what we're targeting with actionable nodules.
Nodule
also is a subjective definition. It's a
subjective characterization of a lung abnormality. Finally, a panel is a group of radiologists with a given task. In this case, their task was to identify and
characterize actionable nodules. Consensus is a term I'll use only for unanimous agreements. When you hear we use consensus, that means
unanimous agreement as opposed to majority agreement.
Then,
finally, a few study definitions. I'll
run through these very quickly because you've got a very nice tutorial from Bob
Wagner this morning. The ROC curve is
the receiver operating characteristics curve.
AZ is the area under the ROC curve, the measure of interest in the
study.
MRMC
stands for multi-reader, multi-case.
I'll use the term primary analysis for our protocol specified primary
analysis and the term ANOVA-after-jackknife.
The ANOVA there is analysis of variance and you've got a nice
description of both the jackknife and the bootstrap earlier.
So
under the reference truth panel the goal of the reference truth panel was to
fully identify all nodules in the case sets.
These are the cases that Ron described how they were collected. We wanted them to rate the actionability of
any nodules that they found.
Specifically we are defining actionable as a nodule that requires
surveillance or intervention so it could be follow-up or it could be more of an
intervention.
We
define the reference truth so that we could use it in the ROC study. The method was to have a panel of three
radiologists independent review the cases and we followed a two-path process to
reduce observational oversights.
The
reference truth panel qualifications were that they needed to be board
certified radiologist, that they had at least six months of reading thin slice
which we defined as less than or equal to 3 mm. collimation CT of the chest,
and they needed to have experience with reading soft copy.
A
total of 11 panelists participated in at least one of the three-member panels
that were convened. Just to be clear,
we didn't have a single three-member panel because it just would have taken
weeks for three people to review the set of cases that we had. We had a succession of panels and there were
a total of 11 different panelists that participated in at least one of those
panels.
Nobody
participated in more than three and obviously nobody participated in less than
one. This is how the panels
worked. We brought the radiologists in
and we put them in three different rooms.
This is after a brief sort of training that we gave them prior to going
to the three different rooms. They had
three different work stations set up and they each independently reviewed a set
of cases. In a typical sessions we had
about 20 cases reviewed.
After
they had reviewed all of the cases for a given day, and this usually took maybe
four or six hour or so, we took the computer files of all of their findings and
these are findings of the exact locations and we brought them together to get
the union of all findings so that redundant findings were captured and we knew
every finding that any panelist had found.
This
is a little hard to see up there but we also at this stage excluded nodules
that were less than 4 mm. in size or greater than 30 mm. in size. Those were protocol exclusions and we had
asked the radiologists not to spend too much time taking precise measurements
as they were doing this.
After
this there were 95 findings where three our of three of the panelists agreed
that it was a consensus actionable nodule.
I couldn't say consensus. Three
out of three agreed and, thus, there was a consensus that it was an actionable
nodule.
Now,
there was also a large set where there was disagreement. Either one out of three or two out of three
of the radiologist had identified the finding and the other radiologist either
had overlooked the finding or didn't feel that it was an actionable
nodule. These went to a second
pass.
The
way the second pass worked is that after about half hour of prep or so they
went back into their individual rooms so they didn't come together and talk
about the cases. They each went back to
their individual rooms and they had the locations of each of these disagreement
findings identified for them. So the
second pass went fairly quickly because they didn't need to go through the
whole case. They were just looking at
and being directed to specific spots and being asked to rate the actionability.
After
this there were 47 additional nodules that went into our truth set of unanimous
nodules. There was also a fair number
that went into what we call the majority group, that two out of three felt that
it was actionable, and a minority group that one out of three felt that it was
actionable.
Our
primary analysis focuses on consensus agreement but we did do some robustness
analyses around the majority and minority.
I'll be talking about that later but for now I'm focused on the
unanimous nodules.
So
as a result of this process the eight three-radiologists panels. I told you there was a series of panels. There were, in fact, eight of them. They identified 142 consensus nodules in 65
nodule present cases. You might notice
that number 65 is slightly different than the 63 number that you saw
earlier. That's because now our
consensus panel is the definition of truth for this study.
You
can see the size of these findings. The
median size was 7.9 mm. and there were a lot of them that were in the 5, 6, 7
millimeter range. The remaining 86
cases were categorized as nodule absent by virtue of not having any of the
unanimous nodules in them.
So
moving onto the MRMC ROC study, the objective of this study per protocol was to
demonstrate that review of CAD output improves performance of radiologists
reviewing MDCT with respect to their ability to accurately identify actionable
nodules.
Our
outcome measures were AzB. That is, the
before CAD area under the curve, AzA, that is the after CAD, the area under the
curve and, most importantly, Azdelta.
This is basically the difference between the two curves. And the hypothesis in a formal statistical
sense -- the null hypothesis was that the mean change in the area under the
curve was zero and the alternative hypothesis, of course, is that Azdelta is
greater than zero meaning the CAD did have a benefit.
The
study was conducted in two phases. We
first did a 32-patient study and then after doing that study we had some
discussions with FDA and we outlined what would be the appropriate methodology
to use for a second study, what the appropriate size for the second study would
be based on the type of methodology that was suggested. So I'm going to be talking about that second
90-case study as the focus of this talk.
The
reader qualifications for the ROC study, so this is, again, new set of
readers. Don't confuse them with
reference truth panel. Completely
different people. It would be wrong to
have the same people. These people had
reader qualifications that they be board-certified radiologists and have at
least three months of reading MDCT of the chest.
The
basics of the study is that we have 15 readers read all cases. We had 90 cases. Of the 90 cases 48 had at least one actionable nodule and 42 did
not have any actionable nodules and that was based on a stratified random
sample of our complete set of cases.
There
were, of course, four quadrants per case by definition but the important point
is that these quadrants, all four of them, were rated pre-CAD and then
sequentially post-CAD. The ratings were
finally evaluated against the reference truth so the ROC curves were drawn by
comparing the ratings which were on a continuous scale to the reference truth
established by the panel.
I
want to clarify what the unit of analysis is because I know people have a
tendency to want to sort of track the numbers as they go through the slides and
see where things add up so, just to be clear, nodules were the unit of analysis
for the reference truth. The reference
panel was supposed to identify every nodule.
Quadrants
-- the quadrant truth was computed from the nodule truth. For instance, if there was a quadrant that
had one actionable nodule and one non-actionable nodule, the quadrant was,
nonetheless, considered nodule-present quadrant because it had at least one.
On
the other hand, if there was a quadrant that had a minority nodule in it, in
other words, a nodule that at least one person on the panel thought was a
nodule but not unanimous, that was considered a nodule absent quadrant. Every quadrant counted in every analysis
that we did.
Now,
the reason that we went with this quadrant approach is that the LROC methods
were not developed at the time that we embarked on this for multi-read,
multi-case studies. I think they
probably will be in time and they may even be right now but at the time we
began the study, they were not.
Bob
Wagner described it a little bit as these being sort of competing fields that
people that went with the ROI approach versus the people that go with the full
localization. I think really there are
two camps that are going after the same thing of trying to get some measure of
localization added to the ROC method.
We
felt that for this particular case where you might have a nodule that was quite
large in one lung and then a smaller nodule in a contralateral, that that
smaller nodule in some cases might be the really important one that actually
drove the care. We felt that getting at
localization in some way was important.
We went with the quadrant approach.
The
quadrants were rated by the ROC readers but then the case, not the quadrant, is
the unit of analysis for the computation of the p-values and the confidence
intervals based on the jackknife and the bootstrap. You heard these references mentioned earlier but Obuchowski
specifically is the reference for using this region of interest or quadrant
approach. Carolyn Rutter is the person
that developed the method of using the bootstrap to sample cases.
The
reading environment for our study is that readers were trained on work station
use and we really tried to create a reading environment that was as similar to
their individual practices as possible.
So the usual work
station controls were available to them.
If any individual reader had a particular window or leveling
preferences, they were allowed to modify that.
We didn't have it in the protocol that they had to read a particular way
that would take them out of their reading environment.
They were allowed to practice on three cases with
the trainer present. The ambient
lighting was adjusted to the radiologist preference. There was no hard time limit.
The
instructions given to the readers was to only search for 4 to 30 mm. actionable
solid nodules, to rate each case post-CAD immediately after the pre-CAD rating
so they had to go through the entire case pre-CAD and provide the ratings
before the computer would even allow them to turn on CAD and then provide the
post-CAD ratings.
They
were instructed to consider age, gender, and clinical indication. These were taken from the radiology
report. We did not provide them with
the full radiology report as that obviously would have provided too much
information for them to be able to make up their own decisions.
So
the basic study work flow here -- let's see which of these works. Yeah, this one works. When you saw the work station earlier, there
was no blue line. The blue line is
separating the upper quadrant from the lower quadrant. We didn't feel like we needed a line to
separate left and right. The yellow
line is indicating where they are in the exam.
As
they were reading the case, they had the opportunity to bring up a pop-up menu
to rate the quadrants at which point they would get this little cartoon of
sorts with these slider bars. They
would move the slider bars either all the way over -- you can't see. There's a little 100 there -- to indicate complete
confidence that there was at least one actionable solid nodule present in the
quadrant, or zero to indicate complete confidence that there were none.
In
this particular case you can see that the reader has gone through and given a
pretty low confidence or, I should say, a high confidence that there are no
nodules present in any of the quadrants.
Having
done that they then have the opportunity to click this button up here and turn
on CAD. It's a little bit hard to see
here but there is a potential nodule.
I'm not a radiologist. I won't
tell you whether it is a nodule but it is located there in the upper right
quadrant. Then they would have the
opportunity to rate the case again.
In
this case they might have changed their rating. In the other quadrant since there was only a mark in the upper
right-hand quadrant, it's fairly unlikely that they would have changed any of
their other ratings but they were allowed to.
So
after doing this with our 15 readers who each read the 90 cases, both pre-CAD
and post-CAD, were able to draw the ROC curves for each of the individual
readers. This is just an example of a
single reader and so the area under the dash line is the pre-CAD Az and the
area under the blue line is the post-CAD Az and then the area in between the
lines is the Azdelta.
These
are the 15 pairs of readings. I didn't
produce this plot specifically to answer some of the questions that came up
earlier this morning but I think it might answer some of them a little bit. Now, this is not the same plot that you saw
earlier. This has the pre-CAD area under the curve on the bottom and
the post-CAD area under the curve going on the Y axis. So pre-CAD the range was from about .82 up
to .96. That's the range of the 15
readers area under the curve. Post-CAD
the low end was .86 to .96 so you can see a narrowing of the range post CAD
with respect to Az.
In
particular, these three readers who had
-- I'm trying to look for a different word than
worst -- had the worst pre-CAD Az performance of around .82 to .84 were the
ones that improved the most, or were among those who improved the most. You might wonder what about readers that did
pretty well. Well, these two readers
did very well pre-CAD, at least, measured against Az. And post-CAD they also had some improvement. It was a more modest improvement. They didn't have as much to improve.
Now,
finally, there's this reader up here.
This reader had a nearly perfect pre-CAD performance. This does just go to .96, not all the way to
1 so they weren't absolutely perfect.
What you worry about with a reader such as this is you don't want CAD to
cause them to change their impressions so they get worse and they did not.
So
moving onto the primary analysis this is the average reader ROC curve. Again, here is the pre-CAD line, the
post-CAD line, and the area in between is the Azdelta. I'm just going to focus in on this part right
here because it is an important point about whether or not the curves cross.
The
curves do not cross and so you can see that they are always apart. Especially in this area here I think is the
area where people are most likely to have their individual operating points,
although, as you saw, they might go all the way out here.
These
are the same 15 dots just plotted against a different axis so this is sort of
how far away they were from that line.
You can see individual reader improvements ranging from about .06 to
zero to no improvement. And then the
idea behind the Dorfman-Berbgaum-Metz ANOVA-after-jackknife analysis is to
create a confidence interval and computed p-value that would allow us to figure
out what might happen with a new reader with a new case.
I
mean, that's really the idea of this confidence interval is what kind of
performance would we expect from a new reader with a new case. You can see that both the individual readers
as well as the average delta and the confidence intervals are well on the side
of CAD better as opposed to the side of CAD worse.
Now,
we went ahead and did a number of robustness analyses and these were basically
about repeating the primary analysis varying different assumptions to
demonstrate that the primary results are not sensitive to study design. I think these are very, very important
because there is a considerable literature that you can tweak different things
and end up with different results. If
we had found that, we would have been in a difficult position because we
wouldn't have known whether or not we really did have a robust result.
I'm
going to talk about this with reference to the statistical methodology,
specifically the ANOVA approach versus the bootstrap approach. There are lots and lots of different
iterations on this but I'm just going to focus on these two. I'm going to talk about the reference
truth. I'll focus on the consensus
standard versus the majority standard but there are a number of other reference
truths that we examined and I'll just focus on those two.
And
then panel variability. I've talked about
the confidence interval being a way of getting at what would happen with a
future reader with a future case. What
you really want to know is what would happen with a future reader and a future
case evaluated against a new truth, right?
That
means that you don't just have to have the random reader and the random case
components of the ANOVA model. You also
have to have some way of evaluating your truth against the random panel if you
are going to fully capture the variability.
So
the ANOVA-after-Jackknife compared to the bootstrap, I'll run through this
quickly because you heard this earlier.
The ANOVA-after-Jackknife is based on leave one out samples. Again, the leave one out here is cases. A case is being left out of each sample as
opposed to a quadrant.
The
Az end of the curve has been computed for each reader case combination and then
analysis of variance random effects model is fit. This is the standard analysis of variance random effects model
with full interactions described by Dorfman-Berbaum-Metz.
The
bootstrap, I think nonstatisticians a lot of times find the bootstrap a little
bit more intuitive. The experiment is
replicated in 1,000 random samples so from our sample of readers in cases, we
generated random samples of readers in a random sample of cases and for each
sample we matched our random readers with the random cases and repeated the
entire analysis.
It
is very computationally intensive but it gives you a way of coming up with
confidence intervals that allow a nonparametric -- fully nonparametric approach
to evaluating what would happen with a future reader in a future case. I do want to point out that the
ANOVA-after-jackknife is semi-parametric.
The ANOVA piece is parametric but the jackknife piece is nonparametric.
So
these are the confidence intervals for the ANOVA versus the bootstrap. You can see that the confidence interval for
the ANOVA is a little bit tighter. For
the bootstrap it's a little bit broader.
One
of the things that the bootstrap is known for is being able to come up with
confidence intervals that are not actually symmetric about the mean because
often there is not really any reason to believe that the competence intervals
would be symmetric about the mean. In
this case you can see it actually goes out further on the CAD better side. Even though the competence interval is
wider, it does not in anyway diminish the results.
So
returning again to the primary analysis, the primary analysis, as I showed you
earlier, is based on a delta Az of .024 and a p-value of .003. I just showed you a different methodology
using the bootstrap and came up with .0246, very close, and a p-value of less
than .001.
Then
we went on to a different reference truth.
The different reference truth that I'm talking about here, and I
apologize that it's not on the slide.
We didn't want to make it too dense, but this different reference truth
is majority so this means that a quadrant would be considered nodule present if
there was at least one majority or consensus nodule and it would be considered
nodule absent if it did not have any majority nodules in it.
A
really important thing to point out here is that the majority quadrants, the
ones that two our of three radiologists in the panel consider to be
actionable. They are included in every
single analysis so that means that when we're talking about the unanimous
truth, they go in to the false positive side of things, as somebody calls
it.
On
the other hand, if we talk about this reference truth, they go into the true
positive side. We felt like we don't
know if those are nodules or not and so the most conservative approach to take
is to always put them in every analysis.
The
delta Az here is a little bit lower but the p-value is actually more
significant, to use a loaded term. This
has to do, I think, with this sample-sized paradox that Bob Wagner was
describing earlier. The final step was
to do the random reference truth.
We
did the random reference -- actually, before I go to that, I want to mention on
the different reference truths in addition to majority and consensus, we also
looked at a minority reference truth which is sort of the loosest possible
standard we could come up.
We
also did a tighter truth based on having a second panel of five people look at
the cases and define the truth more tightly.
In all four of those cases we came up with a similar statistically
significant result. So the random
reference truth is based on picking two panelists at random to review each
case.
Pretend
that the three-member panels didn't exist.
Redo the truth assuming that third person just wasn't there in their
room. When you bring together the
first-pass findings, their data doesn't come in. When you go to the second-pass it's only the two out of two
consensus. This allowed us to come up
with competence bounds that captured that piece of the variance. It ended up being fairly similar, although
the delta Az is somewhat diminished from that of the primary analysis.
So
all variations gave statistically significant results. I'm a statistician so that's what I know
best and that's why I'm best prepared to talk to you about. I take the point of some of the panelists
that -- by panelists here I'm referring to you all as opposed to any of our
other panelists.
You
want some sense of what does it all mean.
What does this Azdelta of .02 mean?
For myself, I find it useful to think about individual operating points. This is the pulled curve where we pull all
of the readers together. You can't
really translate this to a new reader and a new case.
These
are analyses that you don't do to find statistical significance or to get a
particular competence interval or particular estimate. There are analyses you do to try to
understand the data. There were
analyses that we put in our protocol that we would be doing but they were
secondary analyses just to try to get some sense of what's going on here.
So
this is the operating point of 20.
Recall that we have this 0 to 100 scale so 20 reflects sort of the most
aggressive end of the spectrum. We
could go all the way out to 0 but 0 is just all the way at that end. Twenty was an area where you could imagine a
fairly aggressive reader would say, "Even for a 20 I might want to do some
kind of follow-up." Fifty was indeterminant on our
scale so that is one operating point that is interesting to look at. Eighty would reflect sort of the least
aggressive reader. This is by no means
all readers. If I put this plot out
with all 15 of the readers, you get sort of that weird scatter plot similar to
what you saw earlier, but just to get a rough sense of what kinds of
improvements are maybe plausible
So
this dotted vertical line here is the line that corresponds to having the same
false positive fraction. This is saying
that if you started out at 50, your sensitivity could increase by this much
without sacrificing your false positive fraction at all. Not one iota. If you think of the false positive fraction as your measure of
safety and you think of the true positive fraction as your measure of efficacy,
that is saying you can go up and get efficacy without any safety tradeoff.
Now,
it's probably more likely that people are going to go a little bit up and over
so maybe they are going to call more things.
That's what we see with our individual rating. You can go up and over and still have the same positive predicted
value. Even though you are giving up a
little bit on the false positive fraction, you still have the same positive
predicted value.
This
50 here is still a little bit over from that so it's not exactly the same
positive predicted value but the basic point is that you can go up and over
without having a sacrifice or without having a substantial sacrifice.
So
these are the analyses that I mentioned.
They were in our protocol as analyses that we were going to do, but I
really am very sympathetic to what Bob Wagner said about these numbers. It's so hard to say what they mean. What are these numbers. I don't want anybody to run too far with
these numbers but I do feel like it's necessary, especially for people who
aren't statisticians, to want to understand what's going on with some of the
raw data.
If
we take 20 as the threshold for where somebody -- pretend that all readers
treat 20 as their criteria for actionability, then we would have had 16 percent
of the total nodules so there were 1, 125 positive quadrants that the 15
readers looked at. Sixteen percent of
those would correspond to misses. With
this very aggressive cutoff I think odds are those are, in fact, observational
oversights.
Post-CAD
that goes down to 11 percent so the 16 percent versus 11 percent, that's a 30
percent reduction in misses at that threshold.
Now, that is a very aggressive threshold. Probably most readers aren't at that threshold. Fifty might be closer to where most people
are at. It goes from 20 percent down to
16 percent. That's a 22 percent
reduction in misses.
Then
finally if we imagine that 80 is sort of a higher-end threshold of what might be
called a miss, there is still a 15 percent reduction in misses. Now, these numbers are presented without
confidence intervals, without p-values.
Take them with a grain of salt.
But in terms of understanding potentially the clinical importance, I
think that maybe this may satisfy some of the desire to see a different number
than just the delta Az.
I
also wanted to show you what happens if we look at the true positive fraction
and we look at the false positive fraction in a way that is probably more similar
to the way that a lot of academic studies are done where you look at the cases
where you are most likely to see an effect on the true positive side and you
look at the unambiguous nodule absent quadrants on the other side.
Here
I really am throwing out quadrants. As
a statistician I hate to throw out data but I'm throwing them out just to get a
clearer idea of what's going on here.
So if we are looking at the true positive fraction just for the smaller
nodules, and I'm just using -- they are not really small.
I
think a lot of people would define small as less than 4 or less than 3, but the
intermediate-size nodules as a proxy for difficult to find nodules or easily
overlooked nodules. Then you can see
that you get more of a rise in the curve without quite as much of a tradeoff
early on in terms of the false positive fraction. This is analysis that was not included in our protocol. It's just something that I added to try to
get a little bit more understanding of what is taking place here.
So
the study conclusions. Again, the study
conclusions go back to the primary analyses that we did and the robustness
analysis. The study conclusions are
that the imaging checker CT improves reader performance for the detection of
actionable nodules. That was our
objective and that's what we feel that we demonstrated. And specifically the results are robust to
the analytical methodology, to the choice of the reference truth.
Again,
it wasn't just looking at consensus and majority. We looked at minority, majority, consensus, and sort of a super
consensus. Then it is also robust to
the additional variation associated with selection of panelists. I described identifying two random
panelists. We also did it with a single
random panelist, with three random panelists and came up with very similar
results.
With
that, I'll turn it over to Dr. Delgado. Thank you.
DR.
DELGADO: Thank you and good
morning. I am Dr. Pablo Delgado. I'm clinical associate professor of
radiology at the University of Missouri, Kansas City. I also practice at St. Luke's Hospital. I'm here to describe the beta experience that we're involved
with.
First
of all, I'll tell you a little bit about where I practice in the setting, where
the beta site was performed. I am a
private institution affiliated with the university. We have a hospital setting as well as an affiliated imaging
center adjacent to us. We practice with
residents available and we have an on-site residency training program of which
I am the program director.
Our
patient base is quite varied and I think rather common place for the
region. It's a typical mid-west
community base of private as well as community patients. Our CT equipment for our radiology
department, we currently have two four-channel multi-detector CT scanners which
happen to be GE QXI light speed scanners, although I don't think that's of
importance to this device as long as it's DICOM data and meets the collimation
thickness.
We
currently perform anywhere between 20 and 30 CT studies a day of the chest and
these different diagnostic indications including CT pulmonary angiography, high
resolution CT of the chest, detection of other lung diseases, as well as
multi-organ disease workups.
The
beta study that we performed was between the times of June and August of 2003
for a total of eight weeks. We
processed numerous studies. However,
the goal of the study that we agreed upon and embarked upon was to assess the
functionality of this image checker, CAD software, and how we would work with it
to answer the R2 developmental group questions about radiologist preferred
reading practices as well as work flow issues of how this would be incorporated
into our practice. And to determine
future applications of training needs in training radiologists in how to use
this device. It should be noted that we
were not asked to assess the clinical effectiveness of the CAD system.
The
design of the system involved retrospective review of CT chest cases from our
institution from previous months that have already been acquired and already
been interpreted outside of the study and that met the collimation thickness
which, I think, was already mentioned, 3 mm. or less and were contiguous slices
of the chest.
The
cases were read by faculty radiologists as well as residents so we got feedback
from both experienced radiologist as well as radiologist in present training.
For
the training of utilizing the device, we had an R2 application specialize on
site for an entire day who got to work with most if the radiologists. A few that were not available for that time
were given the training subsequently by those who experienced the training from
the application specialist. That
training process involved the description of the CAD algorithm, what indeed it
does and what it doesn't with the review manual.
We
also reviewed several institutional cases.
First R2 had some cases of their own.
Then we through the DICOM hookup were able to push some of our cases to
the R2 device and process them so they were our cases. We also performed shadowing of retrospective
reading sessions where the radiologists were able to work with the CAD device
and subsequently ask questions if they felt that they were necessary or
encountered any questions.
Our
observations from using the beta product demonstrated that most radiologists,
in fact all, demonstrated a rather rapid learning curve for using the CAD
device. In a rather short period of
time most people felt very comfortable in utilizing the product as is intended.
We
encountered no specific technical errors or malfunctions. We had no difficulties. We did, indeed, use it in the way it was
intended and we asked radiologists to first look at the case in a soft copy
reading mode and then subsequently push the CAD button and activate it and then
review it immediately thereafter. We
found that all radiologists missed nodules that were detected by the CAD.
There
certainly are false CAD positive marks as Dr. Castellino pointed out. However, most of these are easily dismissed
by radiologists and that includes both faculty and residents.
Of
course, I would agree with the comments made by other panel -- excuse me, other
presenters from R2 that we feel that radiologists definitely should review all
images initially without CAD and then a subsequent read with CAD. The reason for this is that CAD is not
really made to detect every single nodule and, No. 2, the algorithm is such
that it does not detect every single lung abnormality and radiologists are
still responsible for detecting any lung abnormality.
In
conclusion, I think that this product is very timely in what radiologists are
facing on a daily basis. The
development of multi-detector CT has led to an explosion, if you will, or
significant increase in the number of images that are very detailed and
radiologists are asked to interpret.
Numerous
published studies have already documented there are limitations in
radiologists' ability to detect lung nodules.
I believe the detection really is the limiting factor of eventually determining
actionability whether it is related to further diagnostic or therapeutic or
interventional workups. We found CAD to
me an effective tool in assisting the radiologist in the detection of lung
nodules with multi-detector CT.
I
will now reintroduce Dr. O'Shaughnessy of R2 Technology.
DR.
O'SHAUGHNESSY: Thank you very
much. I just have a couple of summary
slides kind of to bring it all together at the end. I just wanted to reiterate the main conclusion from our clinical
study for multi-detector CT exams of the chest, that the image checker CT CAD
software system significantly at a p-value of .003 improves radiologist ROC
performance for detecting solid pulmonary nodules between 4 and 30 millimeters
in size.
And
as both Mr. Miller and Dr. Castellino talked about and Dr. Wagner this morning,
we feel that is a good measure for -- a reasonable measure for evaluating both
a safety and efficacy aspect of the product.
Also from the safety aspect, the product is intended to be used as an
adjunctive device and with appropriate training we don't think there are any
issues there.
Just
to summarize, I'll put up again the same slides of the proposed indications for
use. We thank you very much for your
attention.
DR.
IBBOTT: Thank you, Dr. O'Shaughnessy.
We
are going to have time this afternoon for detailed discussion of this
presentation but let's take a few minutes now to see if there are any questions
for the previous speakers or clarification that's needed.
DR.
STARK: I have a few questions. Other panelist, please jump in. Dr. O'Shaughnessy, thank you. By the way, it was a fabulous
presentation.
DR.
O'SHAUGHNESSY: Thank you.
DR.
STARK: Very interesting subject and I
think everyone is interested in seeing this technology succeed. Certainly I am so forgive me. Some of my questions are, I guess, by nature
going to be -- are intended to be challenging.
Mr.
Miller talked about, as the panel did, what the word significant -- he used the
term significance is a very loaded term.
Later on when we discuss the marketing materials and things like that,
I'm worried about the pressures on radiologists to buy and use a technology and
want to shift the significance to what really is clinically significant. In
your presentation you pointed out -- I believe several of your experts pointed
out that the real clinical problem is that we're missing about 24 percent of
nodules or we are missing nodules at a significant rate. I think it was something like 24 percent or
something, perhaps you can refresh me, were seen in retrospect.
One
significant figure of merit here would be what fraction of those nodules that
are missed, that 24 percent that are detectable in retrospect, are now detected
with this technology given that the technology by itself has a sensitivity of
about 50 percent for detecting majority and unanimous nodules and a 50 percent
detection rate? I'm just asking. It's very, very low.
That
would suggest to me that at best the technology is going to reduce that 24
percent missed rate to about a 12 percent missed rate at the cost of generating
100 percent false positives and then having a radiologist groom through and
sort all this out by basically being said, "Do it again."
I'm
wonder if we had a placebo in this FDA trial of, "Radiologist, just do it
again, " or, "Here is the sugar pill. Just read it again," would we achieve the same presumptive
50 percent improvement in finding half of the lesions we know the current
standard of care is to miss?
DR.
O'SHAUGHNESSY: Right. I would like to answer that sort of in two
parts. The first part I would like Dr.
Miller to go over what we measured in our study and then have Dr. Castellino
talk about translating that to the clinical environment if that's okay.
MR.
MILLER: I guess there were a number of
questions there. Is there one you would
like for me to start out with?
DR.
STARK: I think you will do a great job.
MR.
MILLER: Okay. So the analyses that I showed at the end with the percent
reduction in misses are sort of approximated percent reduction in misses where
an attempt to get at that very issue. I
suppose that it is to some degree your job and, to some degree, our job to
determine what is clinically significant.
Now,
the numbers that I showed you were sort of in the range of a percent reduction
in misses of somewhere close to 20 percent.
Actually more like 20 percent on the low end. That is similar to what the experience has been with CAD for
mammography.
For
CAD in mammography the percent reduction in misses has been in that range. I think if you are a person that's affected
-- I guess I'm drifting off from statistics here. I should have handed it over to a clinician but, I mean, my hunch
is that is a number that would be meaningful.
As
far as the stand-alone sensitivity, I do want to sort of bring us back to the
fact that we evaluated two modalities here.
The two modalities that we evaluated were the readers stand-alone performance
and the reader plus CAD. The whole MRMC
framework is developed around those particular modalities.
CAD
as a stand-alone modality is not something that anybody is recommending that
people use. Therefore, those
stand-alone numbers, I think, are less valuable but are more valuable if they
pick up some of the more important things.
Also
I think some of those things in the 4 to 10 millimeter range that readers react
to and say, "Oh, I missed that.
I'm glad CAD pointed out."
It's more about what did CAD find than it is about exactly what the
percentage is.
DR.
STARK: Did you answer the core question
of if the radiologist right now standard of care I would suggest, and
clinicians can debate this, is that we miss a quarter of the lesions that are
actually there in retrospect. If we can
accept that as a statement, then as you design the experiment, what data are
there to suggest we would cut that miss rate and by how much?
MR.
MILLER: Will you permit me to go back
to the slide? Sorry. I'll get there soon. Okay.
This, again, is presented as an analyses that was specified in the
protocol that we would do, but you don't have competence intervals there so
these are numbers that you would want to put competence intervals on if you
were going to put a lot of weight behind them.
Also,
they make the presumption that readers all read with the same threshold cutoff
and we know that's not the case. At a
threshold cutoff of 50, let's focus on 50 for just a second, there were 228
missed quadrants. In other words, out
of the total number of quadrants that the radiologist looked at, 75 positive
quadrants times 15 so there are 1,125 times that one of the readers looked at a
positive quadrant.
They
gave a rating less than fifty 20 percent of the time. That is actually kind of a nice number because that number is not
radically different from I think what we see in the literature. It may be a little bit lower. I think there's a little bit of a relaxed
environment in the readings that they may be a little bit more likely to
identify things. But 20 percent of the
quadrant something is missed.
Post-CAD
it goes to 16 percent so that's a 22 percent reduction in the misses. That is, I think, the number that is closest
to answering the question that you raised.
Is that correct?
DR.
STARK: I think so. Let me see if I understand it and then I'll
ask you about the affect on this analysis of the quadrant versus the lesion
methodology.
MR.
MILLER: Okay.
DR.
STARK: I think that prejudice thinks in
favor of the technology. I'm not
sure. So you're saying if the standard
of care currently is to miss a quarter of lesions, then of that 25 percent
we'll miss one-fifth less so now we'll miss 20 percent of the lesions.
MR.
MILLER: Yes. Their miss is defined loosely as you are not actioning a nodule
that a consensus panel believes should be actioned. I don't think that they are actually missing it in every
case. Sometimes they are giving it a
low rating.
DR.
STARK: Correct. But as far as --
MR.
MILLER: Yeah.
DR.
STARK: You can debate the inference but
the literature talks about a missed rated of 25 percent which we are going to
equate with actionable nodules. As we
talk about the parent efficacy of this, and I appreciate your honesty, is that
we are taking a standard of care of a 25 percent missed rate that juries and
patients think is horrible in retrospect and we are going to cut that to a 20
percent missed rate. We can judge the
-- that's the efficacy.
MR.
MILLER: I should also add this is just
based on jumping from one 50 to the other 50 on the curve. We did another set of analyses based on what
happens if you jump from 50 to the other point on the curve where you -- I'm
sorry.
I
should say jump from 20 from one point on the curve to the other point with the
same PBD and jump from 20 to the same point without sacrificing the false
positive fraction. That also was a
protocol specified analysis and the numbers go down a little bit. I don't remember how much but it may be five
or 10 percentage points.
DR.
CONANT: May I interrupt or just jump in
for a second because you are the slide that I'm curious about. You mentioned it's similar to mammography. It is but it's so different. I'm very interested in the by-case analysis
of this compared to by quadrant. The
reason being I think you have a little bias in your case selection and I'm not
sure if that is okay or not.
You
have the majority of your cases, 62 percent of the nodule present cases, as
people with extra-thoracic disease. I'm
not sure I really care about the absolute number of quadrants you've missed
because once you've got three nodules in both lung fields, who really
cares? It's metastatic disease so I
would want to see these numbers by case.
I
also think the comparison to mammography is very different because I think
that, again, chest analysis is much more multi-focal and reflective of systemic
disease than mammography in terms of a bilateral fairly somewhat independent
process. I would just like your
comments on that if you could take this another step and then do it by case.
MR.
MILLER: We did not do these analyses by
case. I suppose the data are there to
do it. I think the challenge with doing
it by case is that the way -- I should let a physician get up here in just a
second but the way that one would action a case where you had one lung where
you had a very high likelihood of it being something bad, using my simple
statistical language, and you had the contralateral lung where you had something
that was probably bad. That one that's
probably bad may actually be the one that drives the care of the patient.
Figuring
out how you sort of wrap this all up and do something like this at the patient
level with something that was sort of beyond the scope of what I was able to
imagine. I absolutely do not disagree
that it's something that would be useful to try to investigate in some
way. Having said that, I think I really
need a physician to answer the question.
DR.
CONANT: I'm not sure what the answer
is, though. However, in your cases it's
very different if a person -- if you're looking for a primary lung carcinoma
versus metastatic disease so they are very different clinical questions.
MR.
MILLER: Yes. Let me let Dr. Castellino answer that.
DR.
CASTELLINO: I'm not going to answer any
statistical questions. I can guarantee
you that. It is hard to answer that
question. I would like to put it more
in a clinical context of how we read cases every day.
I
agree that if you have a patient with a soft-tissue sarcoma and you find three,
four, five nodules, unless you are in a setting where you have surgeons who
aggressively pursue that, as I was at Sloan-Kettering, at times it is important
to find a six or seventh nodule. There
is a spectrum of surgical behavior.
Let's
assume that you find six or seven you don't have to find the last three. We had very few cases like that. The second thing is that we are not
positioning this product as a lung cancer detection product, although it does
work that way. Patients with lung
cancer who had a nodule, it was not necessarily the primary lung cancer. They may have had lung cancer before treated
post-op, post-radiation.
We
accepted those cases and had a lung nodule in the lung for whatever reason so
it wasn't really as a primary detection issue.
I'm not sure I answered that completely and I do recognize that certain
mammography is quite different, as I think we have discussed before, than chest
CT.
I
would like to go back to a couple of comments you made. If I understood you correctly, I think you
said, Dr. Stark, that the issue was that we had a 50 percent sensitivity for
consensus nodules. As I recall from
looking at that, I think, with consensus we were closer to 80 or 83 with the
classic nodule definition. I'm looking
at the -- you'll see that later with Petrick.
If
you stratify those nodules with what would be more definition that radiologists
would call classic nodule. It ranges
from 83 to 59 I think is the number. Is
that correct?
DR.
STARK: We can study it but I'm trying
to draw data from table 10. When I
suggested 50 percent, it was based on this so maybe over lunch you can --
DR.
CASTELLINO: We can go through it. I thought it was about 59. But I think it's a good point. We would love to have developed an
algorithm, to be very honest, that was 100 percent sensitive but this is the
best we've come up so far. I think the
issue to me as a clinical radiologist is how would this affect me or my
colleagues in practice to find more nodules that we look at a year later and
say, "My goodness. How did I miss
that? Why did I miss that?" The
ROC study, to some extent, I think, approaches that. I think this table here to some extent also would address
that. These are nodules potentially
that could be missed or are missed that the radiologist would say, "I
would have liked to have seen that nodule to make a decision as to whether or
not it's actionable or not." I
don't know if I'm addressing the myriad of questions that you had but I would
like to try to -- if you can rephrase some of them I would like to try to
answer them.
DR.
STARK: If the chair and the panel think
we have time.
DR.
IBBOTT: Let's wait until after lunch
and we'll have that detailed discussion this afternoon.
DR.
CASTELLINO: Can you write them out so I
can think about them?
DR.
STARK: I'm not sure of the
protocol. I'll ask for advice.
DR.
IBBOTT: I don't think there is any
reason why you shouldn't present those questions and let them think about them
over lunch.
DR.
CASTELLINO: That would be very helpful
because they are a lot and I think they are important questions. Thank you.
DR.
IBBOTT: Again, I'll take this
opportunity to ask Dr. Mehta if he has any questions that require clarification
at this point.
DR.
MEHTA: No, I don't.
DR.
IBBOTT: All right. Thank you.
DR.
SOLOMON: Do we have time for anymore
questions?
DR.
IBBOTT: Well, certainly. Especially if it's appropriate now to get
clarification on something before we break.
DR.
SOLOMON: I guess I have a couple of
questions for Dr. Delgado. I guess they
start off by asking you a little bit more about what your experience was with
the system and then, more specifically, did you find that you as a radiologist
or any of your colleagues were using the CAD system or becoming more dependent on the CAD system and not quite giving
it the same kind of read that you would give ordinarily? Also, what was the impact on the time that
you spent on a case? Did it make it
longer or shorter? Why don't you answer
those.
DR.
DELGADO: Okay. Thank you.
I think those are good questions.
First of all, we did not do any time analysis with and without CAD or
separate, just soft-copy interpretation and then soft-copy interpretation
without CAD and then subsequently with CAD.
I
think it goes to say that if you are doing the second review that there might
be a time factor that would be slightly increased and that may be something to
be quantified. However, in my
experience I think, first of all, the first question is people were instructed
through the training phase that this device was to be utilized through a
primary read in which you make decisions on whether you see or detect a lesion
and then there is a way for you to mark it.
Then you activate the CAD and then you go through, as Dr. Castellino
said, really not the whole entire study again but only those images that
identified a lung nodule. It might be
on average three per case or so where you might click on a button and that
would take you immediately to that axial's image and show you a lesion of which
then the radiologist would make a decision, "Did I miss this? Is this a significant mark that I would
consider actionable?"
Or,
if not, then easily discharge and be done with it. If it was a mark that is consider a false positive, that would be
discarded easily. I think we did have a
few of our radiologist which initially asked the question, "Well, is this
benign or malignant?"
Yet,
we made sure and I as the principle doctor in charge of this made sure to
remind them that this was not the purpose of this device. It's really only to present you with a
nodule that you may have missed and give you the ability to either add that to
your findings or completely discard it.
Does that answer your question perhaps?
DR.
KRUPINSKI: This will probably be more
for Dave. On point of clarification,
you've got a quadrant and suppose the CAD during the initial view the reader
says there's nothing there. There
really is a nodule and then the CAD comes up and points out the nodule and a
false positive.
Now
the reader increases their confidence and now do you consider that in the
analysis and how can you be sure? Do
you consider that a true positive and an increase in behavior when, in fact,
the radiologist was looking at the false positive? Is there anyway without localization to establish that? If
you were then to take your cases and throw away any instances where the CAD
marked a true and a false positive and the reader went from "false
negative to true positive" what then happens to the ROC curves? Admittedly, although you've got statistical
significance, those curves are pretty darn close and you've got these ambiguous
cases now. How do you deal with that?
MR.
MILLER: Well, the short answer is that
we don't know precisely what happens in those instances. It was not captured. Bob Wagner talked about this best of both
worlds scenario. We really tried in the
way that we did the study not to take the readers out of their normal reading
environment.
We
felt that was very important and so capturing additional data was something
that we thought could take them outside of their reading environment and create
some kind of placebo effect essentially.
We don't have that data on which one of the nodules or which one of the
findings, I should say, which one of the CAD marks they are reacting to.
Now,
having said that, we did after we completed the ANOVA-after-jackknife analysis
you can pull out from that analysis which cases are the ones that were most
favorable in terms of producing a CAD effect and which cases are least
favorable in terms of producing a CAD worse effect.
I
sat down with a dozen or so of those cases with Ron Castellino, our chief
medical officer, and went through them and said, "Is it obvious what
they're reacting to here?" In the
overwhelming majority of the cases it was obvious what they were reacting
to.
The
number of marks per case is small enough that it is fairly unlikely -- I should
say fairly. The case where you have
multiple close to positive findings in a quadrant is not very common. It's common to have two in a quadrant but
most of the false marks are very easily dismissable.
I
mean, our engineers hate it when I say this but there are some vessels. I mean, not a statistician I look at it and
I say, "That's a vessel." So
the radiologist, it's really easy for them to dismiss those.
I
guess the short answer is we did not do the analysis that you are suggesting
but I completely take your point that it's important to figure out what was
really going on in the ratings. I think
I have a pretty good feel for it that they were reacting to true positives.
DR.
KRUPINSKI: So you rate them all as true
positives?
MR.
MILLER: Yeah. I mean, the only thing that -- I mean, just from a programming
perspective, the only thing that is fed into the analysis is the truth for the
quadrants and the ratings. Whether
there were or were not CAD marks there is not actually in the analysis.
You
could do an analysis that was more of a parametric model and a fixed effect
model where you tried to capture whether it was the quadrants with CAD marks
that were causing the increase, but I think it's reasonably obvious that they
are in trying to model that it gets pretty messy building that on top of the
models that we already did.
Just
while I'm up here, I did really quickly want to comment on the issue about the
sensitivity, the back and forth about that table. I think you were doing a weighted average of some numbers in a
table and we'll come back to that later, I think.
The
sensitivity number -- I mean, it's just incredibly variable depending on sort
of which reference truth you use and so if you hear different numbers going
back and forth, it's not necessarily inconsistent. Two people may actually be both reading sort of off the same page
but in a slightly different spot on the page.
Thanks.
DR.
IBBOTT: Thank you. At this point Dr. Stark has a couple of
questions he's going to raise now to be discussed later this afternoon.
DR.
STARK: Actually, it's a response to Dr.
Castellino's question which I respect and it's fair. I have been working very, very hard for this because, as we'll
discuss later, I have spent 15 years wondering why my ROC based prediction that
MRI for detection of liver cancer in 1985 was significantly better than
CT. That was wrong. I think I know why and I think this group
here, the industry group and the panel, I think, were at the nub of it.
Dr.
Castellino, rather than have us giving the formality and the importance of this
scratching on pieces of paper, I've asked the chair to allow me to read. I've formed a question and I'm going to read
it into the record and I'll give you my handwritten copy of what I'm going to
read just so that we're clear on this.
Forgive me. You've seen me
scrambling over three minutes here. If
any of this is unclear, I'll rephrase it.
Thank you for offering to do this.
Would you please calculate
from the data and/or literature discussed or presented here today, and in your
submission, the net decrease in false negative rate which we have here today
estimated to be 24 percent for practicing radiologists working by themselves
when those radiologists in the future, we're projecting, are to add this
technology and these results, these data to their practice, specifically
accounting for what Dr. Conant was just asking about, accounting for and not
crediting as a detection or improvement with the addition of CAD those
quadrants or patients as you compile the data where CAD marked a false positive
lesion in a quadrant where the radiologist alone had a false negative.
Where
that radiologist, in other words, failed to recognize a true lesion false
negative for the radiologist that was not subsequently marked by the CAD.
I
have this written down. I think that
translates into English and I would be happy to clarify. Feel free to grab me during lunch if there
is some nuisance of that that would make a better question.
DR.
IBBOTT: All right. Thank you.
At this point then, we'll call this session to a close and break for
lunch and we will reconvene at 1:15, just a little less than an hour. Thank you.
(Whereupon,
at 12:21 p.m. off the record until 1:18 p.m.)
A-F-T-E-R-N-O-O-N S-E-S-S-I-O-N
1:18
p.m.
DR.
IBBOTT: Could I get you to take your
seats, please, and we'll continue.
Thank you. I would like now to
call the meeting back to order and I would like to remind public observers of
the meeting that while this portion of the meeting is open to public observation,
public attendees may not participate unless specifically requested to do so by
the chair. At this point Mr. Doyle has
a statement to make.
DR.
DOYLE: Yes. The R2 has approached me and indicated that they have developed
answers to the questions that Dr. Stark proposed at the end of the morning
session. In an effort to keep the
meeting moving with the schedule we have, I have asked them to present those
answers at the beginning of the discussion section this afternoon. They have the answers ready and I would just
ask for the flow of the meeting to present those at that time. Thank you.
DR.
IBBOTT: Thank you. We will now continue with the FDA's
presentation on this PMA which will be introduced by Dr. Phillips.
Dr.
Phillips.
DR.
PHILLIPS: Well, in case you forgot what
we're doing over lunch, we are discussing the image checker CT CAD by R2
Technology. It is a system that
analyzes and displays to assist radiologists in review of multi-slice CT exams
to the chest and in the detection of solid pulmonary tumors.
It
is composed of several items. It's a
combination of software and a computer.
The system is a work station which is the image checker CT Model LN-500. This was approved for marketing under a
510(k) K023003, the software which is the operating system for the product that
we are looking at today.
Again,
the indications for use, and I don't need to read those. Then this was reviewed within FDA by a
rather extensive team. Michael
Kuchinski was the team leader; William Sacks was the clinical reviewer; Teng
Weng was the statistics reviewer; Robert Wagner and Nicholas Petrick were
reviewed for analysis methodology; Joseph Jorgens reviewed the software; Larry
Stevens did bioresearch monitoring; Fleadia Farrah did the manufacturing. That's the quality systems regulation; and
Ronald Kaczmarek reviewed it from epidemiological basis.
Two
people will present to you today, Bill Sacks and Nicholas Petrick, discussing
the PMA. The other reviews were all
found to be satisfactory and we are moving on from there.
With
that, Bill Sacks.
DR.
SACKS: I apologize for the jaundiced
look of that. It wasn't so bad in the
rooms we were testing this in. Okay. I'm going to just give some background. Then Nick Petrick will present the data from
the clinical study and then I'll come back and draw some conclusions.
The
outline of my introductory comments, I'll say something about the character of
the device for those of you who did in fact, forget over lunch something about
the clinical utility, a point about the instructions for use, and some issues
that are new to this particular PMA.
First
on the character of the device. Just to
remind you, this is for chest CT scans and for CTs that are done for any
indication the algorithm is trained to detect solid lung nodules, not, for
example, ground glass opacities. It is
trained to detect nodules between 4 and 30 mm.
Also
there was a Hounsfield unit cutoff which is just CT numbers, the amount of
radiographic attenuation that needs to be above -100. In particular, this is a computer-aided detector. Just to say a word about the difference
between computer-aided detection and computer-aided diagnosis, a point I made
earlier.
The
difference between detection and discrimination lies not in the instrument but
in the clinical use to which it's being put.
The detector system, which is what we're talking about today, this
left-hand column, scans entire images whereas a discriminator only scans portions
that are selected by the user. The
detector marks the images where a discriminator will give a level of suspicion
that is just a number. As I say, the
same device will do both but it is thresholded to give you marks when it's
acting as a detector.
On
clinical utility, as we've heard, many nodules are missed in clinical practice
for two major reasons. One, other
pathology distracts and hundreds of images are present in one CT of the
chest. Indeed, you may start out as a
board certified radiologist and after reading 500 images you are certified
board.
A
CAD is intended to reduce the missed nodules, this CAD. That is, it is intended to increase the
users sensitivity to detecting lung nodules.
We will come back to this point.
Instructions
for use. The important points are that
the reader should review the films unaided first. Then the CAD marks the candidate nodules. Then the reader looks again in the vicinity
of those marks.
If
the CAD fails to mark a nodule that was judged actionable on the initial
unaided review, the instruction in the labeling reads that the reader should
retain that initial judgment, not back off just because the CAD failed to mark
it. We will come back to this in my
closing comments.
Issues
that are new to this PMA are should the particular choice of target for the CAD
algorithm, the definition of truth, the unit of analysis and endpoints. I'll say something about each of those.
First,
on the CAD target, the target is not malignant nodules but actionable nodules
as we've heard which, among other things, means that the definition of truth is
not based on biopsy or tissue histology which would be an external standard,
but rather based on the judgment of an expert panel that is an internal
standard based on the very images that are being evaluated here.
The
unit of analysis, as we've seen, at one level of the statistical unit is the
person but it's further broken down into long quadrants and Nick Petrick will
say more about that.
Finally,
the end points. One could do an entire
ROC curves as was done and one could, as Bob Wagner explained this morning, in
addition, or instead of, do the sensitivity and specificity of a particular
action recommendation which was not, in fact, done in this particular study.
In
summary, again, just to remind you, the clinical study consisted of three
expert radiologists drawn from a group of 11 but three at a time on a panel to
determine what was called by the company reference truth for each nodule. Then there were 15 completely different
radiologists with a range of experience, not necessarily experts, that were
called the readers and they all 15 read all 90 cases and the 90 subjects were
divided into 360 long quadrants. Those
15 readers used a 100 point scale for a confidence and actionability rating for
each case.
Now
I'll introduce Nick Petrick who will give you the clinical data.
DR.
PETRICK: Okay. So my name is Nick Petrick and I will go
through -- let me see which one of these work.
I'll go through the clinical results that were done by the sponsor and
some of our perspective. The outline of
my talk will be first to talk about the applicability of Az in the
analysis. Here I'm using the term Az
which is somewhat more of a technical term but this is the same as the area
under the curve or AUC. Other people may
call it area under the curve or AUC but I'm going to use that as meaning the
same thing here.
I
will also talk about and somewhat review what the sponsor presented on the pool
of cases used for the clinical study.
I'll talk about the definition of actionable nodules by the panel of
experts. Then I'll go into the
particulars of the clinical study.
In
particular, I'll talk about the primary analysis which was analysis using a
fixed panel of experts and then what is somewhat of importance here, the secondary
analysis which was the analysis using random panels of experts.
Then
I'll finish up my presentation by talking about the measurement of CAD
stand-alone performance. When I'm
talking about stand-alone performances this is the algorithm performance with
no reader involvement.
Okay. So for the applicability of the agency here,
I show one of the sponsor's curves for the average reader ROC from predisposed
CAD and this had a change in the area under the curve of .024 and a p-value as
shown there .003.
What's
important to note about the applicability of the Az is that degree in curve
here is the pre-CAD and the reddish curve is the post-CAD. And what we're looking for is that the two
curves don't cross. That is an
important measure if we are going to use Az as an overall performance measure
for ROC analysis. What we find from
this average curve is that generally the post-CAD curve is higher or on the
same order as the pre-CAD curve.
So
just to summarize this, the pre-imposed CAD curves did not cross in the average
performance I showed before. I think,
more importantly, there was no substantial pre or post-CAD crossing in either
the average or individual ROC curves.
This is important. That makes the Az statistically
appropriate performance measure for this type of analysis. If they had a significant crossing, we would
have had to look at some sort of partial area or some other measure of
performance in that situation. Because
of this conclusion the sponsor had used an Az as a figure of merit in all their
analysis that follows.
Okay. Now to talk about the pool of readers. Again, just sort of a summary of what the
sponsor had talked about before. There
is a pool of cases. There was a subset
of that which was made of nodule cases.
These were documented cancer cases so the primary neoplasm or
extra-thoracic neoplasm with presumptive spread to the lungs. That is the set of nodule cases. The cases were allowed to contain non-nodule
pathologic processes, things like pneumonia or emphysema and so forth were
allowed to be part of that subgroup.
They
took another set of cases. These were
considered the non-nodule cases and what they term or what can be termed as
normal cases where there was no nodule deemed present by the site PI and that
site PI primarily relied upon original radiology reports in coming to that
determination.
These
cases could include a history of cancer, radiation therapy, or even previous
thoracotomy were allowed to be in this data set. This is a pool of cases that now the sponsor will pull out cases
to run their ROC reading studies from.
At
this point we're not going to talk about -- we are going to talk about
actionable nodules or the object of interest in this application. In particular, there is a panel of expert
radiologists that identified the actionable nodules. This was done in a two-stage process, again, just as a review as
before.
In
the first reading the cases were independent and blinded by three expert
radiologists. The information provided
to the radiologists were the subject's age, gender, and indication for the
exam, obviously along with the exam as well.
Each
individual radiologist marked all findings deemed to be lung nodules. Then the radiologist provided ratings for
each of those nodules so there is a detection test and then there's a rating of
the actionability of that nodule. It
could have fallen into an interventional category. That is an actionable finding where further workup was advised.
A
surveillance which is, again, considered an actionable finding which was
monitored with follow-up studies and this would probably be more typically
additional CTs. Also, they could have
rated as probably benign calcified.
Again, no action required here, or probably benign noncalcified, no
action required.
After
the first pass was done, findings that lack 100 percent consensus after that
first pass were reviewed unblinded by all three radiologists and basically they
are going to reevaluate locations where either two out of three of the panel or
one out of three of the panel call the location a nodule. then the radiologist would rate or rerate
these on the actionability of the nodule candidates.
Along
with this thresholding was applied to match what the general performance of the
area where the algorithms should be performing and so thresholds of greater
than 4 mm. in diameter for each nodule candidate and a peak density of greater
than -100 Hounsfield units. This
considers a CT number and is related to the attenuation coefficient in
grayscales in the CT exam.
Then
after each nodule was identified, each lung quadrant was categorized based on
the highest actionable finding within that quadrant. Then subsequently the quadrants will be used in the observer
studies.
Now,
just to summarize what was found in that initial pass, again, this is three
experts per panel. I'll show in this
column the unanimous actionable. That's
three out of three finding. Majority
actionable two out of three. Minority
actionable one out of three. You can
see that for unanimous actionable there was 142 findings. For majority there were 168. For minority there were 149 findings.
This
gives you somewhat of an indication that panel variability is an important
component here. There's a lot of cases,
almost a third -- only about a third of the cases were unanimously actionable
and another third or so were two out of three, and another third were one out
of three. This gave the FDA an
indication that panel variability was an important component and probably
should be taken into account in the clinical study.
Now
to go into the clinical study, there were multi-reader, multi-case ROC observer
studies. Again, the test statistic was
the Az or area under the curve. I'll
present net results based on analysis of 90 case data set, 360 quadrants. The sponsor also performed a 32-case study
and also presented pooled results of the 32 and 90 cases. I'll just limit myself to the 90-case study.
What's
important the MRMC allows us to look at the variability, confidence intervals,
and significance testing and we can take those into account. That is important obviously in this case to
determine significance and then to try to get an idea of what the separation is
between the reading without CAD and reading with the CAD device.
In
order to analyze the variability confidence intervals and significance two
approaches were used, ANOVA-after-jackknife and bootstrap analysis. So here is just the general flow chart to
the clinical study and this will be followed for all the clinical studies. The study starts out with a pool of
readers. These are going to be the group
of radiologists that are going to actually read the cases and give rankings for
each quadrant.
There's
a pool of cases and there's a pool of experts and the experts will be used to
define truth. There will be a sample
pulled out of cases. It will be used by
the pool of experts to define nodules.
There will be a set of readers picked out. Those cases will then be read using multi-reader multi-case ROC
observer study and an estimate of the Az will be calculated. This could then be redone for different case
sets, different reader sets, and potentially different experts on a panel.
So
the important components here are how to measure the variability confidence
intervals and do significance testing.
Again, two approaches were taken, ANOVA-after-jackknife analysis. This is a parametric type of analysis and
just jackknife if a leave one case out type of analysis.
Again,
we're talking about leaving out a whole case so you're leaving out all four
quadrants together and then performing a quadrant-based analysis on that. So just as a quick example, if we had a case
set of case one, two, and three, when jackknifing is performed or leave one
case out, the first partition is going to be one and two. We've left out case three. The second partition may be set case one and
three, case two has been left out.
Finally
partition would be two and three leaving case one out. Then using those partitions and looking at
the pseudo values that come out of that you can use ANOVA to estimate the
variability confidence intervals and significance. The analysis assumes modality as a fixed effect and readers,
cases, and all interactions as random effects in the ANOVA.
A
second approach to doing this is bootstrap analysis and this becomes important
to look at variability of the truth panel.
This is, again, just to repeat, is a nonparametric analysis. What happens is randomly generated data sets
are created based on the original data using replacement. Just as another quick example, with a case
set of one, two, and three again when you run bootstrap you use replacements of
the first partition, randomly pick maybe case three, case two, and case
three.
When
you do the analysis you assume that case three and case three are really
separate events and we bootstrap across those to get those potential
partitions. The second partition you
may pick case three, case one and case two.
Here all the cases have shown up equally. Then a third partition may be case one, case one, and case two
and so forth.
So
the primary analysis, again, the same basic diagram as before but now there's a
resampling scheme introduced into the analysis. The resampling is used for the pool of readers, again, the people
that are going to -- the radiologists that are going to rank the quadrants and
the pool of cases.
The
truth is based on a fixed three-member nodule definition panel, again, based on
unanimous consensus. The analysis will
be based on ANOVA-after-jackknife. Also
bootstrap analysis was also performed.
What happens here is a pool of readers go in. It's resampled so it picks out a subset of readers. Likewise a subset of cases is selected using
a resampling scheme. The cases go into
the definition panel where the panel is fixed and define the actual nodules of
interest or the quadrants that are positive or those that are negative.
The
set of readers are then randomly selected and go in and perform the ROC
experiment. That gives one estimate of
Az. This process is repeated either
through jackknife or bootstrapping in order to get estimates for the
variability and allow for confidence intervals and significance testing.
So
just the result of the clinical study.
Again, this is for a fixed three-member nodule definition panel. In the first column I show the pre-CAD Az
for both jackknife and bootstrap. The
second column is post-CAD, the change in the Az, the p-value for that
particular test, and the lower and upper confidence intervals.
You
can see that the results are fairly consistent between both jackknife and bootstrap
with a pre-CAD Az of .881 or .879, post-CAD increasing to .905 or .903. With change on the order of .024 we see
fairly small p-values for both the jackknife and bootstrapping. Then the confidence intervals also fairly
consistent.
We
wouldn't necessarily expect the bootstrap and the ANOVA to give us the same
values but it's nice actually to see that there is consistency here between the
two analyses.
So
just some conclusions on the primary analysis.
The sponsor has shown a statistically significant improvement in Az from
pre to post-CAD and that is on the order of .024 or change in area under the
curve.
The
ANOVA-after-jackknife and bootstrap analysis showed consistent performance in
both significance and confidence intervals.
The analysis, however, was limited because it did not take into account
any variation in the expert panel.
Variability of the panel would add uncertainty to the performance
estimates, or we anticipate that variability in the panel would add uncertainty
to the performance estimates.
This
is, I think, an important factor because we don't have this cold standard of
truth. We are dealing with a panel
truth. We expect if we sampled a new
panel, they may come up with a different set of cases. They certainly would come up with some
different nodules there.
One
of the important questions is how would performance change with a different
panel makeup. That is one of the
questions that we had talked to the sponsor about addressing. In particular, looking at a different number
of panel members so if you have a different panel makeup or a different
definition of truth potentially and different sets. What happens if another set of experts was used.
So
a secondary analysis was conducted here.
I'll show there are many different types of analysis done by the
sponsor. I'll concentrate on one set of
random panel makeup. This will be based
on a random three, two, or one-member panel, nodule definition panels and
assuming the definition for truth is unanimous consensus.
Because
of this type of analysis the ANOVA-after-jackknife isn't applicable at this
point so only bootstrap analysis is possible.
It follows a similar scheme as before.
We, again, start with a pool of readers, pool of cases, pool of
experts. Here, however, bootstrapping
is applied to the pool of experts as well so that we have a different panel
makeup for defining truth. That adds
variability into that definition of truth and we can use our MRMC ROC observer
study to take into account that variability.
So
we use bootstrapping to select a group of readers, a group of cases, and a
group of experts. Again, with that
particular combination we get an estimate for Az. That study is repeated a number of times to allow again to look
at variability where we have included variability of the truth.
So,
again, these are random three, two, and one member nodule definition
panels. When I'm talking three-member
panels I'm saying unanimous consensus.
Three out of three have to agree.
When I get results for two members that means two members.
They
both have to agree. Obviously for
one-member panel it is the opinion of one of the members. The sponsors randomly sampled that panel so
that we get the added variability from having many different experts involved.
Again,
the same layout here. The pre-CAD Az,
the post-CAD, the change, the p-value, and the lower and upper confidence
intervals. We can see from pre-CAD this
measurement of performance was .845 increasing to .868.
For
the three-member random panel a change of .022. For a two-member panel it was .832 increasing to .854, again a
change of about .022. One-member panel
.817 increasing to .838. Again, a
change of about .0. This is 21 but very
similar 0.22 on average.
We
also see fairly consistent upper and lower confidence intervals for all
different definitions of the truth.
Then we see the significance values which are fairly small as well. That's sort of interesting because what I
talked about before was that we expected when we incorporate randomness of the
panel in here, we would see an increase or a decrease in the statistical
significance that this would be a harder -- that it would be harder to chose
statistical significance.
Really
we see similar p-values to what we saw when we had a fixed-member panel. One of the possibilities or one of the
trade-offs that may have occurred was something that Dr. Wagner talked about
this morning where when the definition of truth is varied, we have also varied
the case mix or the differentiation between negative and positive findings so
we have now moved ourselves potentially more off the curve where we have a more
closer balance study which gives us effectively a larger number of cases or a
larger number of effective cases.
That
was traded off against the variation in the truth. Those seem to potentially have traded each other off where we
don't see a big difference in the performance.
This is one possibility. It's
certainly not conclusive in any way but it is somewhat surprising that we didn't
see a larger variation in the truth when we randomize it.
So
just some conclusions on the secondary analysis. This analysis take into account the random nature of the expert
panel for defining actual nodules. In
particular, it took into account different number of panel members and
different panel makeup using a bootstrap selection of the panel.
All
variations of the panel make up confirmed a statistically significant
improvement in the Az from pre to post-CAD and this change was on the order of .02. And just a more general conclusion, this
type of analysis where we actually tried to randomize the panel makeup is
likely to be a more appropriate type of analysis for assessment of devices when
panel truth -- when only panel truth is available. That's obviously the case here but we can anticipate other
devices potentially coming in where this will again be an issue.
Finally,
I would like to talk about CAD stand-alone performance. In particular, this is a performance of the
CAD algorithm alone and it's the algorithm's sensitivity and specificity with
no reader involvement so we are just going to measure the performance of the
algorithm on some set of cases or defined nodules.
Why
may this be important? Well, it's
generally important because the radiologist can use this information to
appropriately weigh their confidence in the CAD marking so this is a
measure. If you are a reader or a
radiologist trying to purchase this device, you generally like to know how it
would work. Or if you have the device
to use, to get a feel for how it's performing and what it might be marking.
Likewise,
it potentially can be used as a benchmark for future revisions of the algorithm
so as an FDA perspective knowing some benchmark of performance may help us to
determine how to evaluate new revisions of this particular algorithm when it
comes in.
The
question becomes what's an appropriate performance measure for this particular
device and this isn't necessarily an easy question to answer. Anecdotally the sponsor went back and looked
for the unanimous three out of three fixed-member panel and look at those on
the appearance of the nodules that the radiologist marked.
What
they found was that many of those 142 findings did not meet the criteria of
solid discrete spherical density. They
subsequently went back and reconvened a second panel to reevaluate the nodule
but only based on appearance. Not to
find new nodules but just look at the appearance of those nodules defined.
They
put together a set of five independent radiologists and they were asked to
categorize the nodules into two categories, either what they define as classic
nodule. These are discrete, solid,
spherical ovoid nodules, or as nonclassic nodules. These would be nodules that may not be discrete. they may be hyperdense, irregular in
shape. They may be potentially normal
structures that for whatever reason may not be considered nodules at all. This new panel is only going to look at the
appearance of the nodules and determine whether they are classic or nonclassic
in appearance.
This
is a performance. In the first column
I'll show the number of panels defining the nodule as classic. Again, there was a total of five. I'll just group together zero, one, and two
out of five. I'll give the number of
findings. The true positive fraction,
the sensitivity of the CAD algorithm to those particular subset of cases.
In
general I'll just summarize the CAD false marker rate. Then I'll give a final column to the median
diameter of the true positives detected.
This is just to give an idea if there is any bias on the size of the
nodule based on how many panelist defined it as classic.
So
in the first category less than three out of five there was a total of about 65
findings. The sensitivity was on the
order of about 32 percent. For three
out of five there was a total of 13 findings, sensitivity of approximately 70
percent. Four out of five of the
panelists saying this is classic in appearance the performance jumps up to
about 82 percent. All five the
performance is about 83 percent.
If
you just combined all these findings together a total, again, of 142 based on
the definition of truth. The
sensitivity is on the order of about 59 percent. The CAD false marker rate, it varied between two and three
depending on whether the sponsor incorporated or didn't the equivocal
nodule. If you had a five-out-of-five rating, what you did with the
zero, one, two, three, four out of fives whether you included those or not as
false positives would change the median false marker rate but it's on the order
of two or three per case.
In
the final column we see that this is a range of the diameter to those true
positives. You can see that it ranges
from about eight to nine. For the less
than three out of five it was 7.4. For
three out of five it jumped up to about 11 and fell down to seven again. The idea of this column is just to show
there doesn't really seem to be a bias associated with how large the lesion was
based on how they rated it as classic or not.
Just
as a final summary, if there was less than three out of five panelists, there
was approximately 65 findings and the sensitivity was about 32 percent. If it was greater than three out of five,
there was about 77 findings. This is
about half and half -- relatively close to half and half for the data set. The sensitivity jumped up to about 81
percent.
So
just in summary for the CAD stand-alone performance, what was found by the
sponsor was there was a large variation in performance of the CAD based on the
physician's assessment of the nodule's appearance as classic. Whether it was classic or not would make a
big difference on how well the CAD performed.
Just
a note, generally the CAD -- the sponsors talked about the CAD being associated
with these discrete spherical types of lesions and not necessarily some of the
other types of lesions that were potentially marked.
So
just in summary for this part of the presentation, what the sponsor found was
that the -- what we found was that the Az was an appropriate test statistic for
the clinical analysis and this was based on the fact that there was no
substantial crossing of the pre and post-CAD ROC curves.
The
primary analysis, this was based on a fixed three-member expert panel. It showed a statistically significant Az
improvement in the detection with the CAD.
What was also found was the ANOVA-after-jackknife and bootstrap showed
comparable significance testing and confidence intervals.
The
secondary analysis, this was with a variable number of panel members where the
sponsor varied the number of panel members.
They also varied the panel makeup using a bootstrap selection of the
panel members so this is a random panel mix now. This confirms statistically significant Az improvement in the
detection with CAD.
Then,
finally, for this CAD stand-alone performance what was found was that there was
a large variation in CAD performance based on the reassessment of the nodule's
appearance. A more general conclusion
from stand-alone performances is that this type of analysis is necessary for
appropriate utilization of the device by the clinicians in the field and for
potentially reassessment of future algorithm revisions.
Now
I'll turn it over to Dr. Sacks again to make some conclusions.
DR.
SACKS: Okay. I want to then draw some clinical conclusions about this
statistically significant gain.
Granting the statistical significance of a gain in Az of .02, what is
the clinical significance and this is a point that was discussed somewhat this
morning.
Let
me recall for you an earlier slide that I have excerpted this from. That is, that the clinical utility of this
device is that the CAD is intended to reduce the number of missed nodules. That is, it is intended to increase the
user's sensitivity, not increase the area under the curve, although that is
related.
A
gain of .02 in Az understates the relative gain in sensitivity. Why is that? When the CAD is used according to instructions to retain all
judgments of actionability, even if unmarked by the CAD, the user always
necessarily maintains or increases his or her sensitivity and, indeed, always
maintains or increases the false positive fraction as well. They both have to go up. They could stay the same but that would be
an extreme case that wouldn't likely happen, but they cannot go down either
one.
What
that means in ROC space is that -- let me walk you through this slide -- the
blue curve is intended to be a representation of the unaided initial
reading. The red curve is the aided
reading. We've been talking about the
difference in area between under the red curve and under the blue curve.
But
if you talk about a particular operating point on the blue curve unaided and
ask what happens when you use the CAD, you move to some point on the red curve
and if you obey those instructions not to back off when the CAD fails to mark
something that you thought was actionable, you necessarily move up and to the
right somewhere in that quadrant such as this arrow here so you move to some
point here.
Now,
Dave Miller showed you a number of representative arrows if you were to use a
particular point on the rating scale on the blue curve and keep that same point
on the rating curve -- on the red curve, the same rating, 80 or 50 or 20.
The
problem is that radiologists while they could read by assigning a number to a
study and always obeying a preset range for themselves saying, "If I
assign any case 70 or more, then I am always going to act on it the same
way.
If
I assign between 40 and 70, I'm always going to act on it the same way. If I assign under 40, I'm always going to
act on it in the same way," then those points might be relevant. Radiologists could do that but I'm a
radiologist and I can tell you radiologists don't do that.
What they do do is they look at a case and they decide, "Do I act on this or do I not?" Or if there is a trichotomy such as in mammography where there is biopsy or short-term follow-up or return in a year for screening, that is the decision you make. That gives you an