FOOD AND DRUG ADMINISTRATION

 

CENTER FOR DEVICES AND RADIOLOGICAL HEALTH

 

RADIOLOGICAL DEVICES ADVISORY PANEL

 

MEETING

 

TUESDAY,


FEBRUARY 3, 2004

 

      The Panel met at 9:00 a.m. in Salons B-D of the Gaithersburg Marriott Washingtonian Center, 9751 Washingtonian Boulevard, Gaithersburg, Maryland, Geoffrey S. Ibbott, Ph.D., Acting Chairman, presiding.

 

PRESENT:

GEOFFREY S. IBBOTT, Ph.D., Acting Chairman

BRENT BLUMENSTEIN, Ph.D., Temporary Voting Member

CHARLES B. BURNS, M.S., P.H., Non-Voting Consumer Rep.

EMILY F. CONANT, M.D., Voting Member

THOMAS FERGUSON, M.D., Temporary Voting Member

ELIZABETH KRUPINSKI, Ph.D., Temporary Voting Member

MINESH P. MEHTA, M.D., via teleconference, Chairman

DEBORAH J. MOORE, Non-Voting Industry Representative

STEPHEN SOLOMON, M.D., Temporary Voting Member

DAVID STARK, M.D., Temporary Voting Member

PRABHAKAR TRIPURANENI, M.D., Voting Member

ROBERT DOYLE, Executive Secretary

 

FDA REPRESENTATIVES:

NANCY BROGDON

NICHOLAS PETRICK, Ph.D.

ROBERT A. PHILLIPS, Ph.D.

WILLIAM SACKS, Ph.D., M.D.

ROBERT F. WAGNER, Ph.D.

 

SPONSOR REPRESENTATIVES:

RONALD CASTELLINO, M.D.

PABLO DELGADO, M.D.

HEBER MacMAHON, M.D.

DAVE MILLER

KATHY O'SHAUGHNESSY, Ph.D.

 

 

                    A-G-E-N-D-A

 

Open Session

 

Call to order and the Panel Introduction, Dr. Geoffrey Ibbott, Ph.D., Acting Chairman................................. 4

 

FDA Introductory Remarks, Robert J. Doyle, Executive Secretary 7

 

Update on FDA Radiology Activities, Robert A. Phillips, Ph.D. 13

 

Open Public Hearing

 

Open Public Hearing; interested persons may present data, information, or views, orally or in writing, on issues pending before the committee 14

 

Open Committee Discussion

 

Charge to the Panel, Dr. Geoffrey Ibbott, Ph.D. 16

 

Overview of Contemporary ROC Methods, Robert F. Wagner, Ph.D. 17

 

Presentations on P030012 by Sponsor

 

      Introduction, Kathy O'Shaughnessy, Ph.D.. 95

      Current Clinical Practice, Heber MacMahon,

      M.D...................................... 97

      Device Description and Clinical Trial

            Introduction, Ronald Castellino, M.D. 103

      Clinical Study, Dave Miller............. 115

      User Experience, Pablo Delgado, M.D..... 143

      Summary, Kathy O'Shaughnessy, Ph.D...... 148

 

Lunch

 

 

 

 

 

 

Presentations on P030012 by FDA

 

      PMA Overview, Robert Phillips, Ph.D..... 174

      Clinical Background, William Sacks,

            Ph.D., M.D........................ 175

      Clinical Results, Nicholas Petrick, Ph.D. 179

      PMA Review Summary, William Sacks,

            Ph.D., M.D........................ 202

 

Reports by Panel Lead Reviewers

 

      David Stark, M.D........................ 212

      Brent Blumenstein, Ph.D................. 225

 

Presentation of FDA Questions................. 232

 

Break

 

Panel Discussion.............................. 234

 

Open Public Hearing

 

      Open Public Hearing: interested persons may

      present data, information, or views, orally or

      in writing, on issues pending before the

      committee............................... 309

 

Open Committee Deliberations

 

      Panel Recommendation(s) and vote........ 311

 

      Adjourn

 

 

 

 

 

 

 

 

 

 

 

 


               P-R-O-C-E-E-D-I-N-G-S

                                         9:06 a.m.

            DR. IBBOTT:  I would like to call this meeting of the Radiological Devices Panel to order.  I also want to request that everyone in attendance at this meeting be sure to sign in at the attendance sheet that is available outside the door.  I would note for the record that the voting members present constitute a quorum and is required by 21 CFR Part 14.

            At this time I would like each panel member at the table to introduce himself or herself and state his or her specialty, position title, institution, and stages on the panel. 

            I'll begin with myself.  Some of you have already figured out that I'm not Dr. Mehta.  Thanks to the vagaries of air travel and weather, Dr. Mehta is unable to be here but is joining us by speaker phone.

            I'm Geoff Ibbott.  I'm a medical physicist.  I work at the University of Texas, M.D. Anderson Cancer Center in the Department of Radiation Oncology and Radiation Physics.  I'm a voting member on this panel and have been for several years.  Obviously I'm standing in as chair for this meeting.

            Then, Charles, let's start with you and we'll go around the table and introduce ourselves.

            MR. BURNS:  Charles Burns, Professor of Radiologic Science at the University of North Carolina.  My primary expertise is Imaging Diagnostic Physics and I'm a nonvoting consumer representative

            DR. IBBOTT:  Thank you.

            DR. MOORE:  I'm Deborah Moore.  I'm the Vice President of Regulatory and Clinical Affairs for Proxima Therapeutics.  I'm the industry representative for the panel and a nonvoting member.

            DR. STARK:  I'm David Stark.  My current title is President of MRI of Dettum in Massachusetts.  I'm a clinical radiologist.  I've been a chairman for close to nine years and I know many of you.  I'm pleased to be here.  Thank you.

            DR. TRIPURANENI:  Prabhakar Tripuraneni.  I'm head of Radiation Oncology at Scripps Clinical in La Jolla, California.  I have a practice and full-time clinician radiation oncologist and I am a voting member.  I think this is my first or second date on the panel.

            DR. DOYLE:  I'm Bob Doyle.  I'm the Exec. Sec. of this panel.

            DR. BLUMENSTEIN:  I'm Brent Blumenstein.  I'm a biostatistician in private practice.  I'm normally on the General and Plastic Surgery Panel.

            DR. SOLOMON:  I'm Steve Solomon.  I'm a radiologist at Johns Hopkins Hospital.  I'm a consultant to the panel.

            DR. FERGUSON:  I'm Tom Ferguson, professor emeritus of cardiothoracic surgery at Washington University School of Medicine, St. Louis.  I'm a temporary voting member on this panel.  I'm on the Cardiovascular Device Panel.

            DR. CONANT:  I'm Emily Conant.  I'm the Chief of Breast Imaging at University of Pennsylvania and sort of half research and half clinical at this point.  I'm a voting member.

            DR. KRUPINSKI:  I'm Elizabeth Krupinski from the University of Arizona.  I'm a research professor in the Department of Radiology.  My area of expertise is observer performance and image perception studies. I'm a voting member.

            MS. BROGDON:  I'm Nancy Brogdon.  I'm not a member of the panel.  I'm the liaison to the agency.  I'm the Director of the Division of Reproductive Abdominal and Radiological Devices.

            Dr. Mehta, would you like to introduce yourself?

            DR. MEHTA:  Yes, please.  I'm Minesh Mehta.  I'm a radiation oncologist in terms of specialty and I'm the Chair of the Department of Human Oncology at the University of Wisconsin.  Generally when I'm there I'm chair of the panel but today I guess I'm listening in.

            DR. IBBOTT:  All right.  Thank you, everyone.  Mr. Doyle would now like to make some introductory remarks.

            DR. DOYLE:  Well, first on the agenda here is appointment of the Acting Chairperson.  Pursuant to authority granted under the Medical Devices Advisory Committee Charter dated October 27, 1990, and as amended August 18, 1999, I appoint Geoffrey Ibbott, Ph.D., as Acting Chairperson of the Radiological Devices Panel Meeting on February 3, 2004.  This is signed by David Feigal, the Director of the Center of Devices and Radiological Health.

            Now I would like to read the appointment of temporary voting status.  Again pursuant to the authority granted under the Medical Devices Advisory Committee Charter dated October 27, 1990, and as amended August 18, 1999, I appoint the following individuals as voting members of the Radiological Devices Panel for the meeting on February 3, 2004, and they are as follows:

            Brent Blumenstein, Ph.D., Thomas Ferguson, M.D., Elizabeth A. Krupinski, Ph.D., Stephen Solomon, M.D., and David Stark, M.D.

            For the record, these individuals are special government employees and consultants to this panel under the Medical Devices Advisory Committee.  They have undergone the customary conflict of interest review and have reviewed the material to be considered at this meeting.  Again, signed by David W. Feigal for the Center of Devices and Radiological Health.

            Finally, the conflict of interest statement.  The following announcement addresses conflict of interest issues associated with this meeting and is made part of the record to preclude even the appearance of impropriety. 

            To determine if any conflict existed, the agency reviewed a submitted agenda for the meeting and all financial interest reported by the committee participants.  The agency has no conflicts to report.

            In the event that the discussions involved in any other products or firms not already on the agenda for which an FDA participant has financial interest, the participants should excuse him or herself from such involvement and the exclusion will be noted for the record.

            With respect to all other participants we ask in the interest of fairness that all persons making statements or presentations disclose any current or previous financial involvement with any firm whose products they may wish to comment upon.

            Now, if there is anyone who has anything to discuss concerning these matters which I have just mentioned, please advise me now and we can leave the room to discuss them.  Seeing none, the FDA seeks communications with industry and the clinical community in a number of different ways,

            First, the FDA welcomes and encourages pre-meetings with sponsors prior to all IDE and PMA submissions.  This affords the sponsor an opportunity to discuss issues that could impact the review process.  Second, the FDA communicates through the use of guidance documents.  Toward this end, the FDA develops two types of guidance documents for manufacturers to follow when submitting a premarket application.

            One type is simply a summary of the information that has historically been requested on devices that are well understood in order to determine substantial equivalence.

            The second type of guidance document is one that develops as we learn about new technology.  FDA welcomes and encourages the panel and industry to provide comments concerning our guidance documents.  I would also like to remind you that the meetings of the Radiological Devices Panel for the remainder of this year are tentatively scheduled for May 18th, August 10th, and November 16th. 

            You may wish to pencil these dates in on your calendar but please recognize that these dates are tentative at this time.  I'll repeat them in case you didn't get those.  May 18th, August 10th, and November 16th.

            DR. IBBOTT:  Thank you, Mr. Doyle.

            At this point Nancy Brogdon, who is Director of the Division of Reproductive, Abdominal, and Radiological Devices of the Office of Device Evaluation has a few words she would like to say.

            MS. BROGDON:  Thank you, Dr. Ibbott.  We have three panel members whose terms just expired on January 31st.  They are not present today but we wanted to recognize publicly their contributions to the panel.

            The first is Mr. Ernest Stern.  Mr. Stern was the Chairman and CEO of Thales Components located in Totowa, New Jersey, and he was the industry rep on the panel for the past four years.  He is now retired from Thales.

            Mr. Stern effectively represented various industries served by this panel and used his position on the panel to apprise other panel members of commercial considerations that they should take into account when making recommendations on the various applications under review.

            Second is Dr. Wendy Berg.  Dr. Berg was the Director of Breast Imaging in the Department of Radiology at University of Maryland at Baltimore.  She served on the panel for four years as a voting member.  Dr. Berg brought to the panel a high degree of expertise in the field of mammography. 

            That was continually called upon as novel mammography related devices were reviewed by the panel.  In addition, when asked, she provided written reviews of complex devices applications that the agency used as part of our in-house review process.

            Third is Dr. Harry Genant.  Dr. Genant is Professor of Medicine and Epidemiology, Orthopedics, and Surgery at the University of California at San Francisco.  He also served as a voting member for four years.  Dr. Genant brought to the panel a brought spectrum of expertise with special emphasis on bone densitometry.  His probing questions and insightful comments on the pros and cons of the devices being considered were very helpful to the agency as it reviewed the safety and effectiveness of new devices.

            We thank all of these past panel members.  each will be sent a thank-you from the commissioner along with a mounted service plaque.  Thank you.

            DR. IBBOTT:  Thank you.

            Dr. Robert Phillips, the Chief of the Radiology Branch of the Office of Device Evaluation will now give a brief update on the FDA radiology activities.  Dr. Phillips.

            DR. PHILLIPS:  Well, good morning again.  As you can see by the absence of meetings between December '02 and now, we have not had a whole bunch of brand new PMAs that we've brought to the panel.  In fact, in the last year we have not approved any PMAs.

            However, there have been some changes in the branch itself and we have brought four new people on board as reviewers.  These are Nancy Wersto who comes to us from industry.  She's a radiological physicist and her interest area is in radiation therapy products.

            Then we have Kish Chakrabarti who comes to us from the mammography side of the center.  He is a physicist.  His area of interest is mammography and imaging systems.  Kish, are you here today?  No.

            Dr. Barbara Shawback comes to us from outside.  She's a medical officer and her area is study and design in rheumatology.

            And then we just had a new employee come on board, Sophie Packerel.  She is a physicist who comes from the University of Chicago and her area is CAD systems. 

            Those are the four people that have come on board and ends my talk.  Thank you.

            DR. IBBOTT:  Thank you.  We'll now proceed with the first of two half-hour open public hearing sessions for this meeting.  The second half hour open public hearing session will follow the panel discussion this afternoon.

            Both the Food and Drug Administration and the public believe in a transparent process for information gathering and decision making.  To ensure such transparency at the open public hearing session of the advisory committee meeting, FDA believes that it is important to understand the context of an individual's presentation.

            For this reason, FDA encourages you, the open public hearing speaker, at the beginning of your written or oral statement to advise the committee of any financial relationship that you may have with the sponsor, its product and, if known, its direct competitors.

            For example, this financial information may include the sponsor's payment of your travel, lodging, or other expenses in connection with your attendance at the meeting.  Likewise, FDA encourages you at the beginning of your statement to advise the committee if you do not have any such financial relationships.  If you choose not to address this issue of financial relationships at the beginning of your statement, it will not preclude you from speaking.

            No individual has given advance notice of wishing to address the panel.  If there is anyone now wishing to address the panel, please identify yourselves at this time.

            Seeing none, I would like to remind public observers at this meeting that while this portion of the meeting is open to public observation, public attendees may not participate except at the specific request of the chair.

            We can now begin the first open public portion of the meeting.  We will now, as I said, proceed with the open committee discussion portion of this meeting that has been called for the consideration of PMA 030012 for a computer-aided detection, CAD device, that assist a physician in identifying actionable, solid nodules in CT images of the lung.

            The first presentation will be by Dr. Robert F. Wagner of the FDA who will give an overview of contemporary ROC methods such as may be used in measuring the effectiveness of the CAD and other imaging devices.

            The sponsor, R2 Technology, Inc., will then state its case for the PMA and they will be followed by the FDA with its review of the device.  We will proceed now with Dr. Wagner's presentation.

            DR. WAGNER:  Cybersource as I am, let us see if I can -- okay.  Progress or regress?  Let's not start from the back.  Marvelous.

            Thank you very much, Bob.  I'm glad we planned this together this way.  Good morning to the members of the panel, my colleagues and visitors today.  I must acknowledge the fact that Dr. Bill Sacks and I were awakened by our respective wives at our respective homes every two hours this morning to see what the weather would be like to see if we would be able to make it and what time we should really get up.  We are working against that as our background.

            I would also like to thank my colleagues for giving me this opportunity to present this tutorial information on an overview of the contemporary ROC methodology as it is used today in the field of medical imaging and computer assisted devices. 

            Of course, most of us know what the letters stand for.  ROC stands for receiver operating characteristic.  This is the historic name that comes down to us from the field of radar in signal detection studies where the problem is you're looking at a field of clutter and the question is is there an airplane in that clutter.

            In the field of psychology and this perception in eye and brain coordination studies, this subject is often called the relative operating characteristic.  Some people are just weary of the R and just refer to this as the operating characteristic because that's really what it is. 

            Those of us in the field of medical imaging have retained the name of receiver operating characteristic.  I think it is because of our devotion to the classic literature from about 30 years or so ago that we have just retained, the conservative people that we are.  I see a person who has worked in this field looking back at us.

            Well, now here is an outline of the talk.  We will spend a few minutes talking about efforts toward consensus development on the present issues.  Then we'll move right into the ROC paradigm.  We'll talk about how it gets complicated by the problem of reader variability.  How the multiple reader multiple case, or so-called MRMC ROC paradigm, arose to address this problem of reader variability.

            Since the ROC is a measurement, you have to have a meter stick of some kind so we'll talk about measurement scales.  There will be a categorical scale, patient management or action scale and a probability scale that we'll talk about. 

            Then for today's submission, and submissions like it, there are additional complications from the problem of location uncertainty, from the problem of not really knowing the truth and dealing with uncertainty in the truth.  Since the truth is uncertain, you really don't know how many effective number of samples you really have.

            When you have a system that's going to cue readers about the possibility of lesions on a case, there is a problem of reader vigilance that we will discuss.  Finally, we'll give a little wrap-up which I won't have to give because Bob Phillips just presented it for me.

            Let's start off now with efforts toward consensus development on the present issues.  The fact is that at the moment we do not have an explicit FDA guidance on how to review, how to submit and review issues like the present one.  There's been a lot of work going on and deep background as to how did we get here.

            The basic idea is how do you use the classic concepts of sensitivity, specificity, and ROC analysis to assess performance of diagnostic imaging and computer-assisted systems.  Especially since there are many new issues and levels of complexity that come to the fore as more complex technologies emerge.

            At the moment you see there is really no software to do the assessment task of the problem we have before us.  That's why I would like to talk about piecemeal, all the different pieces and what is known and what does exist at the moment because the sponsor had to put together a creative combination of these many things.  So continuing on this little laundry list.  I'll give you an historical laundry list of efforts toward consensus development on these present issues.

            That's RSNA.  Most of you recognize that.  That's the big Radiological Society of North America meeting that's held every year in November in Chicago that makes this weather look very mild today.  Then following RSNA by a few months is the big SPIE medical imaging meeting.  At the SPIE meetings we generally handle the more technical aspects of the issues that come up at the RSNA.

            Then there's a society that meets every two years called the Medical Image Perception Society of which Elizabeth Krupinski on our panel has been president for 40 years I think it has been.  Elizabeth is the President of the Medical Image Perception Society.  We hold various workshops and literature every two years.

            In all these meetings every few years we do note progress in this field.  There is tremendous progress going on but it's without a doubt still an evolving work in progress.  We are still not at the holy grail point that we would like to be at but a lot of progress has indeed been made.

            At the good old FDA at our center in CDRH here at the FDA.  One of the methods that I'll be talking about today is the so-called multiple reader multiple case, the MRMC scheme which has already been used for several submissions. 

            It was used to break the log jam that was holding back digital mammography from the market place so the MRMC scheme that I'll talk about in a few minutes was used there.  It has been used for all successful submissions of digital mammography PMAs to our center.

            This method that we'll talk about in a few moments has also been used for a successful submission in the area of a computer aid for lung nodule detection on chest x-ray film that is in some way analogous to the present submission but it's just on plain film.

            NCI, National Cancer Institute, also has lung image database consortium and workshops.  This is an NCI funded group of five universities and the principle director of that project, I though I saw him come in a moment ago.  There he is, Larry Clarke.

            There are five universities that work as part of this consortium and they are seeking consensus on a number of things, one of which is how to put together a database of annotated films of the kind that you would use, annotated CT slice images of the kind you would use to train and test a classifier in this field of computer-aided detection and diagnosis in lung cancer screening for nodules. 

            So that project is about half-way through its five-year history.  A good two years underway right now.  They are also addressing consensus on the many issues that you have to deal with when you want to deal with such a product. 

            For example, how do you keep score statistically?  Once you know how to keep score, then you can start to design the size of a database.  How do you outline the nodules?  How do you keep score when there's a hit when there is just finite overlap between what is known of the lesion and what the reader marks?  We'll talk about this in a few moments.

            Now, two of here in our center have been quite active members of this LIDC from the beginning.  Let me see if I have another comment here.  Yeah.  The thing I would like to bring to your attention this morning is that there has been a great amount of communication among all these resources here.  A number of us in our center here are active members of the research community in this field. 

            Many of us here and sitting just behind me have been very active in this area of applying these methods to several of the submissions in the area of imaging a computer-aided diagnosis.  Several of us are very active members, Larry Clarke's group here.

            What we have tried to do is see this as several quarters, four quarters if you will, if a quadrangle all holding the windows open to the others so the people who come in to us from industry at any given moment will know what is the state of the art from the academia, from our own center, and from the LIDC. 

            We presented them all the papers, all the current drafts even, and made sure that everyone knows what's on the other people's mind methodology wise that is outside the area of anything that is proprietary.  Anything that is not proprietary is all strictly methodology or statistics.  We have tried to keep these communication channels as open as we could.

            Here we go with the promised little tutorial and the fundamentals of the ROC paradigm itself.  The idea is, of course, that you have two populations, one a population of actually diseased people.  You might think of these as people with diabetes, for example, and a population of people who do not have the disease. 

            You would like to have a test that puts out a result something like a volt meter or a biochemical assay or, in the case of a simple blood sugar test, this would just be the blood sugar concentration.  You would love to have the world such that the two populations would be separated and you could just drop a threshold in here and say these patients are diseased and these patients can go home and not worry about it.

            Now, in the field of medical imaging those of us who have done work in that field you don't have a simple meter or biochemical assay.  What you get is a reader looking at about a million pixels of a picture and trying to get the features out of it and reduce that through what we call the subjective likelihood, subjective judgment or likelihood that case is diseased.

            Now, as I say, this is really not quite the way the diabetes blood sugar test works but if you think of what I am about to tell you in that context for the next few minutes, you won't be far off base.  It's not precise but it wouldn't be misleading.

            So here is what happens more typically.  The two populations are not separated.  The diseased population and the nondiseased population as far as their test result is concerned have a very great overlap.  The idea is now who do you send home and who do you send on for further workup or people that you want to treat for a condition. 

            Those of you who have seen this before, what I've just done I've taken these two and dropped this population down so that you won't get mixed up with the colors.  Now we have the nondiseased cases and the diseased cases on the same axis, the same relative position.  Now in a practical situation with the overlap, now we have to set ourselves a threshold.

            If this is a blood sugar test, for example, you could set it at 150 blood sugar level.  If you do that, you'll pick up about half of the actual diabetic patients so we say we have a true positive fraction of 50 percent but you have to pay for this price.  You have about a 10 percent false positive fraction so here is this point, 50 percent true positive and roughly 10 percent false positive.

            We call this a less aggressive mind set and I think you'll see the reason for that in just a moment.  So if we get a little bit more aggressive to try to pick up more patients in our sieve, we might set the threshold down here at 100 instead of 150.  Now we get about 80 percent of the diabetic patients and now at the price of about 20 percent false positive or 25 percent.  Here I've put this point about 80 percent and 25 percent.

            Let's get even more aggressive and what I mean by that is I want to pick up more diseased patients in my sieve, the sieve being the test.  If you set the threshold in the 90's, now we might get almost 95 percent of the patients in our sieve of the actual diabetic patients but then we have to pay the price of 50 percent of the nondiabetic patients picked up so now we have a 90 percent sensitivity and roughly a 50 percent sensitivity.

            Now, you can take this to the extreme and we talk about this particular test all the time and I think this might not work because the threshold now -- oh, it did work.  Okay.  We can put the threshold all the way to the left and call everybody to the right of this diseased and we would get all the diabetic patients.  There's a little mark right up here.  We would get also -- the price we would pay is we would have to call everybody who is not a diabetic a diseased patient here so we would generate that point.

            I think you can see and let your imagination go wild that you can certainly fill in all these points.  Don't blink, anyone.  I saw Dr. Bob Doyle blink there so I have to go back and do that again.  Instead of working up more and more levels of aggressiveness, you could back off.  You could start off with everybody at the sick point and then just back off, move the threshold the other way and fill in the complete ROC curve.  You can see at this time of day I'm very easily amused.

            Okay.  Here is the overall picture now.  This is the case of the schematic of, let us say, blood sugar as a test for diabetes.  These are these two populations and the way they overlap and here is the corresponding ROC curve with the level of aggressiveness increasing.

            Now, it can happen and, in fact, we've seen things like this in our center and you see this in the laboratory once in a while, the two populations could fall right on top of one another so that a test cannot actually discriminate between the two conditions so what we've done here is just drop this population and this population on top of each other.  Now if you generate an ROC curve the way I just showed you, you would generate what we call the chance line or guessing line.

            Toward the other extreme you could have a test that separates the two populations very well.  In that case, as we move the threshold across from less aggressive to more aggressive, we'll generate this ROC curve.  Now we have the guessing line, we have the ROC curve corresponding to almost typical clinical laboratory test, and we have the ROC curve here for a very good test.  We call this the level of increasing -- we call this direction the direction of increasing reader skill or increasing level of technology.

            Now, many people like to have a single summary measure of ROC curve performance and what has traditionally been used is you take the area under the curve so the area under this curve, say the diabetic discrimination test, is something in the high 70s.  Let's call it 78 percent or something like that.

            If you use the area under the curve as a summary measure of performance, in effect, remember if you think of calculus, you're getting this area you're just integrating, you are effectively replacing the curve with a line that is fault at the level of that area. 

            In effect, what you've done is you have averaged the sensitivity with a true positive fraction over all false positive fractions.  In effect, if you use the area of the curve you are given the sensitivity averaged over all false positive fractions or sensitivity averaged over all specificity, specificity coming from the other direction.

            Well, I hope it gets interesting now.  That was the easy part.  That's the idea.  Let's see what really happens in the real world.  In the real world in the last decade those of us who work in this field have been made acutely aware of the complication of reader variability. 

            I'm going to show you some very famous data.  I think Emily Conant knows this like the back of her hand from having worked with Craig Beam.  For those of you who have not seen this before, I have to give a little build up to this. 

            This is a set of data from Beam, Layde and Sullivan that I'm going to show you in which they studied 108 mammographers randomly chosen from around the United States.  The mammographers in this study were given a set of mammograms.  They were asked to set their threshold for action. 

            Remember when we were talking about this ROC paradigm we were moving a threshold and we wanted to set it at some place and the question is in a clinical laboratory test you could just dial that in somehow.  How do you do it in medical imaging?  You don't have a dial. 

            You have to deal with the human reader and they were asked to set their threshold between their sense of the boundary on the BIRADS scale, Breast Imaging and Reporting and -- Reporting or Recording?  Anyway, Reporting and Data System.  That's the American College of Radiology Scale that is used for managing patients in mammography.

            These readers were asked to set their sense of the boundary between category 3, which is generally six-month follow-up recommendation, and category 4 which is highly suspicious and recommend consideration of biopsy.  I'm sure I'm garbling that but you get the general idea.  I wasn't asked to leave the room so I couldn't be too far off there.

            Here's what happened.  This is a true positive fraction versus a false positive fraction for 108 readers.  There are 108 points here.  Each one of these people thinks that they had set the boundary between category 3 and category 4. 

            If you try to do public policy based on category 3 and category 4 and thinking that people have optimized that, the optimum is very broad.  People have not figured out how to optimize that.  That's a big problem.

            Let's look at this reader.  This is one out of 108 people.  This person has a sensitivity of 70 percent and a false positive rate of about 25 percent.  Now, this person thinks they are being as aggressive as they should be in the context but this person is more aggressive than this one, this reader is more aggressive than this one, this reader is the most aggressive on this bottom curve here, and these readers are less aggressive.

            Now, as we go in the other direction, we now see the variability due to the range of reader skill.  We can say that these readers have a greater skill at this task than these readers and these readers have the greatest skill yet. 

            At any level of reader skill we have different readers thinking that they have optimally set their threshold.  This is a tremendous range of reader variability.  There are 108 mammographers represented on this graph.  This is classic work from Craig Beam, Peter Layde and Dan Sullivan.

            What have I just told you?  There is no unique ROC operating point.  Each one of these people is set to be at a certain operating point.  There is no unique ROC operating point.  There is not even a unique ROC curve.  There is only a band or region of ROCs as you can see.  There is a very broad band. 

            I hope I've convinced you all now that this gets to be a more complex issue.  In particular, here is the question.  Suppose we have two technologies that manifest themselves in reader's hands with this level of variability? 

            How do you compare those two technologies?  That's the issue before us with a whole class of problems that we've been discussing over the last few years and we'll be seeing more of over the next few years.  How do you do it?

            This is not an isolated example.  People have gotten used to this and said this is really an extreme example.  This is not the most extreme example we've ever seen. 

            In our group we have actually looked at over a dozen real world publicly available data sets and the example I just showed you is sort of in the middle.  Sometimes things are a little bit better.  Sometimes they are even much worse than what I just showed you.  Sometimes things are a little bit better.  Sometimes they are even much worse than what I just showed you.  The following is an example from Dr. Jim Potchen from plain chest x-ray picking up the disease on chest films.  These are ROC curves.  Dr. Potchen looked at over 100 radiologists and 71 residents.  He averaged the score card ROC wise of his top 20 radiologist.  Here they are.

            Then he presents here the average ROC curve for his radiology residents.  There are 71 of them here representing this average line.  The bottom 20 radiologists in the study performed here.  The range that we see here is comparable to what we saw in the Beam, et al. study for mammography.  So this is the real world.

            Well, you can imagine that if you wanted to keep score under that setting you have to use a lot of readers and a lot of cases.  The paradigm that has emerged to address this is, thus, called, almost eponymously, I guess, if I could pronounce that word, the multiple reader multiple case, or MRMC paradigm.

            There are a lot of designs for this.  There are many ways to do it.  Today we will just talk about something that is called the fully -- oh, I forgot my prop.  We'll talk about the fully-crossed design.  The fully-crossed design is one of many but it is the most efficient in some way so we will talk about it.

            You match cases across modalities and you match readers across modalities.  If I can pull this off.  I'm used to having leaves of paper here.  Okay.  You have a bunch of patients who have been imaged with modality A here.  The same patients imaged with modality B so we say that the cases are matched across modalities.

            If we were working with computer-aided diagnosis, modality A would be readers reading without the computer aid and modality B would be readers with the use of the computer aid.  There is a stack of images here.  Same patients. 

            We recruit a panel of radiologists, something like 15 of you people here.  All of you read every patient case in both modalities.  What we have then is we have the cases matched across modalities and we have the readers matched across modalities.

            This design is the most statistical power for a given number of readers and for a given number of cases with verified truth.  Thus, we say it's the least demanding of these resources.  Around here in Rockville we speak of this as the least burdensome paradigm because you probably heard in previous meetings that the FDA has been commissioned by Congress to enable sponsors to seek and to find, if possible, the least burdensome path to the marketplace through the review process.

            So what we've done is we've always called this to the attention of incoming sponsors that this design is most powerful.  You can use alternative designs and you can come close sometimes to the efficiency of this scheme but this is the most powerful in terms of the ground rules I have on the slide right there.

            Well, if you are familiar with the literature in this field, you will say, you know, this is no modern big deal.  This stuff has been known for a good 20 years or so.  If you read the classic book by Swets and Pickett the whole idea is laid out there.  The trouble is there was no practical way to implement this scheme 20 years ago until people started to understand what's called the statistical approach of resampling strategies.

            I probably shouldn't spend any time on the past history but the fact of the matter is in past years before they realized about resampling they just started to stratify the data and then you give up a lot of statistical power.  In modern times in the last 10 years people realized if you use the statistical resampling, you can use the data over and over again in a well-pedigreed way and get statistically valid inputs.

            So the two most famous resampling schemes are called the statistical jackknife or the statistical bootstrap.  The big break through came in this field in 1992.  This is the classic so-called DBM paper.  That's Donald Dorfman of happy memory whom we lost to out community very sadly two years ago.  His colleague, Kevin Berbaum, and the well-known Charles Metz at the University of Chicago.

            This paper broke the log jam in this field.  They suggested using the statistical jackknife in combination with classical ANOVA and the statistical jackknife just being a leave-one-out method where you leave Mrs. Jones out one time and you leave Mrs. Smith out the next time and you generate a lot of data sets that way, submit it to classical ANOVA, and you can do your inference about the difference between these two competing technologies.

            Well, it turns out this is a little bit more difficult to explain in any more detail than that.  But the bootstrap method is very trivial to explain in some detail so I'm going to ask you to sit through that with me for the next minute or so.

            The idea with the statistical bootstrap is that we are going to -- the bootstrap itself means you are going to resample from a set of data points with replacement.  I'll show you that in a moment.  We are going to bootstrap the experiment of interest.  We'll draw random readers, random cases, and then carry out the experiment of interest many times.

            Here is an example of some possible bootstrap samples from a set of -- suppose there are 15 of you here.  We might have a set of numbers one through 15.  We start drawing them with replacement.  If you wait long enough, you might get a list that has one, two, three, four, five, six, seven -- you have to wait a long time before that happens. 

            In the meantime you get more random looking samples like this.  When I was thinking about this, you know, if you did this with letters this reminds you of that proverbial experiment where they have the monkeys trying to type out the soliloquy of Pollonius or something like that.  It's going to happen but you may have to wait a long time.

            Instead what you do is you get random samples like this.  The number one never showed up in this group.  The number two showed up once.  Number three showed up a couple times.  Number 14 showed up three times and so on.  You randomly sample a number and then put it back.  Write it down.  This can go on for an astronomical number of times.

            Then another example, the number one shows up, number 15 shows up and so on.  You get a lot of these, a very great number of these but you don't have time to do them all so, in practice, people use about 1,000.  It depends on the complexity of the problem.

            So you draw about a 1,000 bootstraps of readers and cases.  The number of cases you draw is comparable to the experiment you are trying to mock up.  Then what you do is with that bootstrap safe on the random case sample, you have all the readers in their bootstrap sample read all the cases in both modalities in that bootstrap sample, carry out the experiment of interest so you would get the performance measure. 

            That's called area under the RC curve for the one.  You get that number for the other.  You take the difference.  You do that 1,000 times and then you put them in order from the lowest different to the highest.  Then it's very easy to get the mean and then you can take out the central 95 percent junk and that would give you a 95 percent confidence level.  That's a simple way to explain the story. 

            In the jackknife plus ANOVA it's a little bit more elaborate than that but you can actually think of the jackknife as the first order of approximation to the bootstrap.  So these two approaches are sort of in the same spirit but one is completely nonparametric and the other is -- the classical ANOVA is heavily based on the multi-variate normal so it's highly parametric.

            As I just said, you obtain a mean performance over readers and cases but it's much more interesting.  The mean is always easy to get no matter how you approach a problem.  Well, it can be tricky.  But the big thing you want is error bars that account for both the variability of readers and cases.

            You know, in the DBM paper they quoted a quote that has become very famous from Jim Hanley.  Many of us know Jim Hanley from McGill University in Montreal. 

            Jim Hanley says, "When you report the results of your experiment to your readership, it's not so important just to report the mean performance or the results you got in the very experiment at hand because, after all, this experiment will never be done again.  No one will ever do this particular experiment. 

            What readers want is they want a sense of the range of performance to be expected if this experiment could be repeated many, many times drawing randomly, one hopes, from the same population from which the current samples were drawn.  So that is the idea. 

            You ought to be able to report to your readership not just a p-value because we all know it takes p-value to get a paper published in a medical journal.  You want to actually be able to explain the range of variability you expect to see if this experiment is done over and over again.  That's what you get when you keep score this way.

            Okay.  We said that the ROC curve is a measurement.  Above all else it is a measurement so you have to think about a measurement science.  You have to think about the scale you'd be using for reporting and doing the measurements.

            Historically -- I should just stop for moment to tell those of you who were not around in the late '70s and early '80s that the National Cancer Institute gave a contract to people in Cambridge, Massachusetts, Bolt, Beranek and Newman, where John Swets, David Getty, and Ronald Pickett and colleagues were working to develop a protocol for how to do ROC experiments and how to keep score and how to do the data analysis. 

            That is published in a paper in science 1979.  The book came out in 1982 and many of us have that book on our shelf.  The protocol used at that time was so-called historic ordered category scales.  There was no does this patient go to biopsy or not.  You just looked at the case and you said this patient -- you use five or six categories. 

            One patient you might say this patient almost definitely does not have disease.  There are several intermediate levels.  The patient probably does not have disease, might have disease, probably does have disease, or almost definitely has the disease.  That scheme of five or six categories was almost exclusively used and there was software for analyzing that for 25 years.

            I'm being a little defensive because people may say why do people use that.  That was approved by -- the experts in the field put it out and it was supported by NCI.  There was a lot of science underneath it and today people say, "Why did people do that?"  Well, that's what they had. 

            In the last 10 years in the field of mammography we have this BIRADS scale which is what we call an action item or a patient management oriented scale.  In that idea you don't categorize the data.  People think of the BIRADS scheme as a categorization scheme.  Let's just put that to the side for a moment.

            We'll just think of using the BIRADS scale to dichotomize patients.  We'll say these patients will not be followed up at all versus these patients who will get a six-month follow-up.  That's one way to dichotomize the data. 

            Another way to dichotomize the data is to say we will try to make the break as we did with the Beam, et al. data.  We'll make the cut in this dichotomization between those patients who would get six-month follow-up versus those who we think should be biopsied right now.  So this is a patient management scheme.  This is just a dichotomization scheme. 

            About 10 years ago people realized for very technical reasons that it would be useful to use what they called the continuous probability rating scale, or quasi-continuous.  It's a hundred-point scale, one, two, three, four, five, but you wouldn't get 1.5 for example so they call it quasi-continuous, hundred-point scale. 

            Nobody expects anybody literally to use probability 13 or probability 17 or anything, but the idea is to scale your probability or your sense of the likelihood of disease along a probability scale.  That seems natural to use something if it's a probability on a scale from zero to 100.

            So this is the most popular scheme that's been used to generate ROC data in the last five or seven years or so.  This felt strange to many people, especially people who are used to using the categorical scale.  But I've talked to a lot of people about this and very few people outside of the mammographers have read the BIRADS document. 

            If you go through the BIRADS document and you go to category four, which is suspicious and recommend for biopsy, it actually tells you there that the radiologist should tell the referring physician their sense of the probability of cancer.  There is actually a culture already existing in which you can use this kind of patient management action items like a BIRADS three, four, five, and at the same time give a continuous probability of disease rating.

            I see some puzzled looks.  I'm trying to figure out just what I should comment on next.  So to make a long story short then, this continuous probability rating scale has been used for most ROC curves generated in this community for the last eight or so years.  In the breast imaging --

            Oh, I remember what I was going to say.  That's why I'm stalling here.  In the breast imaging community many people, it may not be more than half, but people do use this BIRADS scale.  But it's really important to realize that this BIRADS scale was not generated -- was not designed to generate ROC curves.           People who have tried to use a five-category scale in this scheme and the BIRADS scale at the same time have met with a lot of confusion.  It does not work out very well and I see somebody who may have witnessed people having that experience.

            Well, I gave a lot of background here because I would like people to understand that this is a real issue for the community you would really like to have because every clinician says, "I want to know the patient management and I want to know the score card of the patient management."  Every clinician you talk to, that's what they want.

            Everybody who measures ROC curves says, "I want to measure it as finely as I can.  I want to use this quasi-continuous reporting scale."  The best of both worlds would be to get both the quasi-continuous rating to get the ROC curve and the patient management action item to get a single sensitivity specificity point.

            I'll get a little dramatic for a moment here.  I've talked to many friends.  I'm very familiar with the literature.  I could find one example in all the literature at the moment that's in print where both of these were done.  I could only find one example of where the best of both worlds was done.        This is a paper on classification, what Bill Sacks and others called CADx using a computer not to detect but to classify lesions on a film that are already known.  I know that I have a stack of films here that have microcalcification clusters on them.          My task is just to say which ones are benign and which ones are malignant.  That's the task.  But I'm going to keep score ROC wise and I'm also going to keep score patient management wise.  I'll show you what they got in a moment. 

            These authors -- Yulei Jiang, I guess, was expected here today from a group in Chicago under Kunio Doi.  They studies this test and they had 10 readers and they studied the complete ROC curves.  They studied all the summary measures and they also studied the patient management or the action item, sensitivity specificity point. 

            Here are the results.  Here is the average of 10 ROC curves for 10 readers trying to make this dichotomy, trying to make this distinction between benign and malignant lesions.  Here is the ROC curve in the unaided by computer condition.  This curve was generated using the hundred-point probability scale.

            This is the curve in the computer-aided condition, again generated by the hundred-point probability scale.  This point is the mean sensitivity specificity point generated just by making the threshold, dichotomizing the data.  These patients benign, these patients malignant.  This is a single dichotomy patient action point in the unaided condition. 

            That's the same point in the aided condition.  You would love these points to fall on top of the curves and, for all statistical purposes, they do because remember the mean -- I have to remind you of this famous joke that we use around here.  There was a six-foot statistician.  You know what happened to this fellow, right?  He drowned while wading in a stream that had an average height of five feet.  You have to know about the variability. 

            This is not about means, okay?  This curve moves all over the place and this curve moves all over the place in practice.  This is the average of 10.  Same thing.  This point moves all over the place as does this.  For all practical purposes this is a great experiment.  This point falls on that curve.

            Well, it's the only case I could find in the literature.  How come you don't see more of this?  When you live with these people that I live with, it's a great crowd of people and the clinicians say, "I want the action point."  I say, "The committee wants to measure the ROC curve."  Everybody says, "Let's do both."  We are trying to come to that position.  Why don't we see more of it?

            Well, the area under the ROC curve, remember, you have your ROC curve and you've got the area under it.  You are essentially getting the sensitivity averaged over all specificities.  Right?  You're averaging.  You're going to average away a lot of noise.

            The variation -- the variance of the area under the ROC curve -- oh, my goodness.  The most important number of my entire talk is missing.  The variance of the area under the ROC curve is the binomial variance over two.  There's a two here, a very important two.  Those of you who know me know I'm an expert in factors of two.  It's the binomial variance over two. 

            What's the binomial variance?  Well, I thought if you had a group as we have here today, about a third of you -- maybe 40 percent of you as I look around -- know what the binomial variance is.  Suppose we had this meeting next week and we drew from the same population from which you all came. 

            The next time we did it we might get 32 percent of you might know what the binomial variance is.  If we do it three weeks from now and joint another group in, maybe 49 percent or 52 percent of you will know what the binomial variance is. 

            What we've just done is what Bill Sacks refers to.  We just made a self-referential example here.  The binomial variance is the variance I would experience if I did the experiment I just discussed with you.  The area under the ROC curve experiences only half of that variance. 

            If I studied sensitivity by itself and was able to tell you ahead of time what the specificity was so you didn't have to estimate the specificity, the variance of sensitivity is the entire binomial variance.

            In the real world you have to estimate both the specificity and the sensitivity so the uncertainty in the specificity propagates into that and the sensitivity so the variance for that.  So if you wanted to estimate the uncertainty in that action item that I showed, that point, the circle or the triangle in the previous data, if you were to estimate that, you would have to live with an uncertainty that was greater than the binomial variance.

            If you use area under the RC curve you get a great reduction.  You get the binomial variance over that famous factor two.  This is all approximate but it works out very well with very practical examples.

            So what we say is that the variance of the ROC area is the least burdensome approach to putting quantification into this problem.  I remind you that is something that we are supposed to enable sponsors to appreciate.

            Another thing that we realize in many discussions with academics and within our house and with the sponsors and so on is if you want to live in both of these worlds, that requires consistent conventions.  If you want to be able to either get categorical reporting and the BIRADS reporting, that's a lot of work to try to get people to be consistent that way.  People have dropped the categorical scheme for all practical purposes.

            Even if you want people to be consistent between BIRADS and the quasi-continuous scale, that's difficult.  We've seen a lot of data in our own group and from some of the universities.  When you train people, this can be done but not everybody is trainable right away to be able to do this so it's an issue.  To get data in both worlds then, it's going to require some convention development.

            My final point here says this may require consensus bodies to promote the practice.  We would hope that the American College of Radiology, some of them other professional societies, and even the fact that this is of interest to NCI and the FDA, we would hope that some this would encourage people to try to do measurements so that we could get both the point and the curve.  Then I think everybody would be happy.

            Well, this brings us to a little interim here.  Some of you are very familiar with the next few slides.  These are what we call the most famous slides in the RC archives.  Those of you who know Charles Metz have seen this many times and his followers will use these many times.  Charles died using these slides over 25 years ago.

            Here's the classic question.  You have two diagnostic modalities, modality A and modality B.  Which one is better?  You look at them and you have people doing public policy thinking in their minds.  Which one of those is better?  You start calculating something you've seen in a statistical decision theory book.

            But the way this is approached in the field of medical imaging is the following.  There are several possibilities here.  Those two points may lie on completely different ROC curves.  In that case we say that modality B is unambiguously better than modality A because at any false positive fraction the sensitivity of A is lower than that of B.

            There's a different scenario.  The two points could fall on the same ROC curve.  Then you have these same people scratching their heads and saying, "Where should they really operate?"  Well, in principle we believe that readers can move their level of aggressiveness.  Not on any fine scale but we know that they adjust depending on the risk group their seeing.  Some people do move around on their ROC curve so in principle these two points are in equivalent modality.

            As I say, people will for years say, "There must be one of these operating points that's better than the other."  Remember when I showed you that data from Craig Beam you saw people at every level of aggressiveness.  Each one of these people in some way thinks they've optimized. 

            This is what we call the expected utility function or the expected value function.  Every one of those people thinks in some way they have found the optimal operating point but they disagree with each other so this is another reason for using the ROC method.

            There's yet another scenario.  ROC curves may actually fall in such a way that modality A is everywhere higher than modality B.  For the same reasons we would say that modality A is the superior modality in this scheme.  Three different possibilities.  B higher, equivalent, A higher.

            This is the motivation for trying to get a finer measurement on this hundred-point scale.  Then if the clinicians really want to know about the actual operating point, that is another step and we are all for that if you can coordinate the measurements but it's very difficult to do that.

            Well, I'm sure many of you are sitting there thinking what about if the ROC curves cross?  We know if that happens the situation enters the world of ambiguity.  Then you can no longer necessarily use the total area under the curve as a sufficient summary measure of performance.

            Other summary measures may be necessary.  There are any number of other ways to make a summary measure of curves that cross.  You can use partial areas.  There's actually software even for that today.  Or you can use parametric summaries of the curve and there are several other ways to look at this. 

            If you decided you're going to use other summary measures, if you anticipate this possibility, the study protocol is expected to address this because if you wait until after the study and say, "I was going to use the partial area in this region," we have a name for that.  That's called data dredging.  You have to build that into your study up front.  Otherwise, when people do not expect to see the curves cross in any real way, they tend to use the area under the curve as a summary measure.

            Well, for submissions as are coming before us in the area of computer-aided detection schemes, there is a question of how do you keep score for the location scored.  I must remind you this is shocking to people who have never heard this before. 

            The basic ROC paradigm is an assessment of the decision making at the level of the patient.  You don't say, "Where does the patient have diabetes?"  You say, "This patient has diabetes."  Or you say, "This patient has TB."  You don't say, "The TB is here."  You say, "This patient has TB."  So the score keeping until recent years has been based on decision making at the level of the patient. 

            In more complex imaging you want to do the assessment of the decision making at a finer level.  You would like to assess how well the localization was done.  Well, there are little errors there that come across funny.  If you do localization, of course, you will be providing the experimenter with more information. 

            If you have more information in the study, you get more statistical power.  The trouble is to do all this adds complexity to the experiment.  I would just like to review for you a couple of the highlights of the issues that have come up when you try to do location specific ROC analysis, so-called LROC for location specific ROC analysis.

            The biggest problem is that if you want to keep score of a hit, the measurement of the hit depends on the criterion you use for localization.  If the legion really is here somehow and you draw your circle and you say the legion is here, there is a certain amount of overlap and you would be surprised to see how sensitive the measurements are to that degree of overlap to the criterion you use for that.  That's a real issue.  There's no unique result.  There's no unique LROC curve at the moment for the state of the field.

            There are a couple of subtle points here that are very technical.  I would just like to mention one of them.  People have studied this for 20 or 30 years.  For a certain class of problems if you study the ROC and if you study location specific ROC, the curves in the summary figures tract with each other monotonically. 

            If the one goes up, the other goes up.  If one goes down, the other comes down.  They might change at different rates but they go together monotonically.  So people haven't felt bad about just using ROC analysis instead of LROC analysis if they were willing to invest the extra resources because you will lose statistical power. 

            But people have been willing not to go to this level of complexity and to go to that higher level of complexity requires more elaborate models, more elaborate assumptions.  These are still debated until today.  You can see in the SBIE handbooks that people are debating this back and forth, Charles Metz and Dave Chakraborty.

            But I must mention that a lot of progress has been made in this field.  The bottom line of this slide if you haven't followed any of this is that essentially there's a lack of validated software for analysis of such experiments.  Now, Elizabeth and the MIPS, Medical Image Perception Society, website actually has software for several of these approaches.

            The writers of that software feel very good about the state of their software but there continues to be discussions in the field about how far have they validated.  Have they checked whether the alpha level and the reject rate are agreeing and what is the power and so on. 

            The debate goes on but I expect that people coming down from Pittsburgh any day or any week now saying, "You've got to start using this because it's been validated."  That's the state of the knowledge right now.  There is software there but there are still people discussing the condition of the validation of the software.

            So a few years ago to find some kind of a happy medium Nancy Obuchowski of the Cleveland Clinic and colleagues said, "Why don't we just simplify the task?  Why don't we do something called region of interest location specific ROC analysis.  Let's only require localization to within a quadrant so you don't have to say there's a lesion here or a lesion here.  You just have to say I see a nodule in this quadrant.  You require localization only up to a quadrant."

            Similarly for the other quadrants you could say, "Why didn't we do it for octants or 16 fold or 32 fold?"  Well, you could.  This is sort of the entry level, this problem, but as you add number of possibilities, then you get more into questions of overlap and ambiguity so people have decided, "Let's start at the level of just quadrants."  As I say, sort of the entry into thus problem.

            Continuing on discussing this so-called ROI approach, the location specific ROC analysis, right away Dave Chakraborty jumps into the literature and say, "Wait a minute.  This doesn't correspond at all to the clinical task."  People have debated that back and forth whether it does or not.

            But from the other wing of this Greek chorus comes the methodologist to say, "Yeah, it may not be quite right but it's really straightforward to account for correlations without getting into these assumptions that people have debated for a while."

            What do I mean by that?  Here are four quadrants, the right side of the lung, the left side, the top, and the bottom if you will.  Whatever is going on in this quadrant is expected to be correlated with what is going on in this quadrant, or at least could be, and similarly across the quadrants. 

            After all, this is the same person, has the same genes, experienced the same environment, and had a picture taken with the same imaging system.  One has to allow for the possibility that these quadrants are correlated.  The nice thing is that Carolyn Rutter and others came by another year later and said, "Wait a minute. 

            All you have to do to preserve those correlations is when you resample you resample on a patient basis.  You can't start resampling products this one from this person and this one from that person.  You have to resample on a patient basis so if I sample you, all four quadrants from you come into that sample and so on.  When you do this, you actually preserve the correlation structure and you are said to be using the patient as the independent statistical unit here.

            Well, that's all I'll be saying about location specific score keeping and now to one of the real problematic issues in the submissions as we'll be seeing in the next couple years.  This is the problem of uncertainty of truth state.  There's a classic paper that all of us have almost memorized by now from Revesz, Kundel, and Bonitatibus 20 years ago.

            This is Harold Kundel known to many of us as one of the pioneers of this field, the mentor of someone on our panel today, who was at the Temple University, and now is at the University of Pennsylvania emeritus.  These authors, what did they say?  They included various ways of obtaining panel consensus truth. 

            They actually did a study comparing three different ways of doing chest imaging and they had the truth but they set the truth aside.  They said instead of depending on the truth to keep score, let's get a truthing panel.  What they found out was they had several ways of obtaining consensus from that panel.       They could either use unanimity.  They could use majority.  They can use some kind of expert review.  They have three or four ways of reducing this panel to truth.  They compare three imaging modalities, as I said, and here's what they found.  Any of the three imaging modalities could be found to out perform the others depending on the rule you used for reducing the panel to truth. 

            So this sobers a lot of us in the field about using a panel as truth.  However, today the target of this experiment we'll be discussing today is not to say this is a nodule that is a cancer.  It is only to say this is a target.  This is a region that a panel of experts would consider to be an actionable nodule. 

            We're not trying to keep score based on the truth.  We're trying to keep score based on what would a panel of experts do?  Would they cue this region or not?  Nevertheless, even though we changed the target, this classic reference above tells us that there's going to be additional uncertainty because of this panel.  The panel will have variability in it and if you go to RSNA over the last few years, you'll hear papers on this subject.

            What we've said to incoming sponsors is that we strongly encourage you to resample, to come up with some resampling schemes to resample the panel to get a feel for the additional uncertainty that comes into this problem over and above the MRMC paradigm, over and above due to the fact that there is noise in the panel.  You can start to see why there is no canned software to do this problem.

            Well, since the truth is uncertain, it turns out that leads to uncertainty, in effect, in the number of samples you have.  Let's talk about designing an experiment for a moment.  Suppose you want to design experiments that are going to have very tight error bars on the sensitivity.  Everybody know that if you want to do that, you want to have a lot of actually diseased cases to tighten up the error bars this way.

            If you want to tighten up error bars the false positive way, you wouldn't have a lot of actually non-diseased cases.  If your endpoint is the area under the RC curve, what distribution should you have between nondisease and disease cases?  Well, it turns out it should be some kind of average between the two.  It turns out that the number you should be using is the harmonic mean of the numbers in the two classes. 

            The numbers in the two classes is going to depend on the panel, right?  Because some of the panel members will say these are diseased and others will say these are diseased.  The actual number of diseased cases depends on the panel.  We have uncertainty in truth that leads to uncertainty in the number of samples.

            This is almost a trivial curve and I'm just going to tell you about the highlights because we think it might factor in today.  Suppose you are told you can design an experiment with 100 patients.  You say, "How should I distribute them?" 

            Well, you distribute them, let's say, at the beginning of an experiment like this so that you have 20 that are actually nodule containing cases, 80 non-nodules, 20 nodule containing sites so we have an 80/20 break.

            This effective number, this harmonic mean of those two numbers, is 32.  Whereas if I make a more even split, 60/40, 50/50, for 60/40 it would be up in the 40s the effective number.  On a 50/50 split the effective number of samples for that experiment would then be 50.  That's not surprising.

            The reason we're showing this is suppose you start out with an experiment like this and you are requiring unanimity in the panel to declare a nodule-present.  Then suppose you relax that criterion and say instead of requiring unanimity, we'll just require two out of three.  Then you expect that whatever the number was before you're going to move up this curve.

            So you are sampling variability, losing power, but gaining samples.  You may tend to cancel.  We don't know this.  We are speculating about this.  We'll discuss this.  What I just said is if you want to get into the realm of resampling your panel, you could start by relaxing the panel criterion from unanimous to majority and there are several other ways of doing this. 

            This is just, again, an entry level.  When you do this, this gets you into the game.  This allows you to resample, to assess the variability, but it may also increase the effective number of samples.  These effects may tend to cancel.  This is, again, speculation just based on the direction of these effects.

            The last thing I want to talk about today is the problem of controlling for reader vigilance.  When you do an experiment, with my two little pads of paper here, when you read in the unaided reading condition versus reading the aided reading condition, there are some people in this room who may be competitive. 

            If you're reading in the unaided reading condition you say, "The computer is about to tell me what it thinks."  If you are a little bit competitive, you are going to say, "I've got to be careful when I read this."  You may increase your vigilance.

            How do you mock up?  How do you do this experiment?  This is a challenge that hasn't been quite sorted out.  Any measurement setting has an artificial condition compared to the actual real world of practice.  What I just described to you is the possibility that some readers might be more vigilant in their unaided reading because they know they are subject to the site.

            Well, when you turn a modality lose in the real world, just the opposite could happen, right?  The readers might be less vigilant in the real world because they know, "Well, I can brush through this.  The computer is going to give me what it thinks in just a minute."  In the real world the vigilance could go down.  In some experimenters it could go up and I think we've seen experiments when the vigilance didn't change but I'm sure you can guarantee that.

            The only thing we've seen in the practical solution to this problem, Heang-Ping Chan and colleagues about a dozen years ago wrote a paper in which they said, "Look, this is a real issue, this vigilance. 

            How do you do a controlled experiment controlling for reader vigilance?"  They said, "Well, just simply control the time available to readers in the unaided reading condition to mimic the actual clinic.  That was a suggestion I made.  I don't know how many people have tried that yet but that's in the air.

            Well, you can all take a deep breath now.  We're in the summary.  Here we are.  This field has been going on for 30 years.  In the last 10 years the whole issue of reader variability has complicated it and there have been ways to promote it to address the issue of reader variability.

            In the last few years we've had to deal with the complications from location uncertainty, from uncertainty in the truth, this issue of reader vigilance.  What we've tried to do is this is like a quadrangle, as I said.  We hear it sitting at the FDA and also doing some research here. 

            We have our academic colleagues doing research in academia, industry sponsors doing research on all these issues in another side of the quadrangle, and NCI and the Lung Image Database Consortium that we've been very actively working with and who are very interested in these issues. 

            We've tried to hold the windows open so that this quadrangle from all courts has been open to everyone.  Whenever industry sponsors have come in with issues like this we've said, "Look, the windows are open. 

            Here's what is known from all these quarters.  Here are the papers.  Here are the drafts that are not even published yet.  Here's what we know at the moment.  We don't have guidance.  We can't say this is where the FDA or anyone is holding the bar but this is all the knowledge that we have at the moment."

            There is no canned software.  There's canned software for little pieces of this problem so any industry sponsor would have to be creative to come forth with a novel way of putting all these pieces together.

            Well, that's the state of the world as we know it today.  Thank you very much for your interest in this.  Oh, there's some papers.  The "tz" are obviously Charlie Metz's papers.  There are a few papers from our own group in which we have actually worked with Charlie Metz and our own statisticians and our clinicians try to review the state of the world.

            This is the first LIDC document.  It's going to come out in April.  Then in your notes there are many other pages of references.

            DR. IBBOTT:  Thank you, Dr. Wagner.  Before you go too far, I would like to ask if there are any questions from the panel for Dr. Wagner.

            DR. KRUPINSKI:  What's the consensus?  I mean, the quadrant problem gets rid of the localization problem if you end up with a nodule in each quadrant.  What it still hasn't addressed, what do you do, for example, when you've got two lesions in a quadrant?

            DR. WAGNER:  That's right.

            DR. KRUPINSKI:  You still have that basic uncertainty.

            DR. WAGNER:  That's right.

            DR. KRUPINSKI:  The flip side of that is what if there is a false positive in the quadrant along with a true positive?  You've just simply squished it --

            DR. WAGNER:  That's right.

            DR. KRUPINSKI:  -- into a quadrant and you still have avoided the localization problem and the problem of a false positive and true positive.

            DR. WAGNER:  That's right.  That's been sidestepped.  As you know, the higher levels of software attempt to address this one way or another and I think the jury is still out on whether we are ready to use that.  I think the inventors of those other methods think they are ready to go and they might be but we also know there are people in the wings saying I'm not sure about these assumptions and so on.  That software does not have general providence right now.  Maybe that's too bad.  Maybe it should be.  These are real issues.

            DR. BLUMENSTEIN:  I'm impressed by the MRMC study design.  I think that's a nice step forward.  I'm wondering if anybody has ever subjected the same reader to the same image multiple times and studied the effect of that so that you could get at this issue about how a single reader uses their own personal scale?

            DR. WAGNER:  Yes.  That's a classic question.  There are experiments on that.  I'm making this up but this is the spirit in which I remember it.  David Getty has shown some data on this in mammography and I think that readers are correlated with each other in the 60 percent range and are correlated with themselves only 70 some percent on repeats.  There is, indeed, a lot of reader variability intro.

            However, you get more bang for buck -- if you want to spend so much time in radiology reading-wise, there's more bang for buck to get a different reader than to use the same reader over again because you are so correlated with yourself you get more independent information if you bring in a sample that's not so correlated with the preceding reads.           Bank for buck-wise people have said this is a question of reading time.  People have not in the MRMC paradigm in general tended to have readers reproduce their readings.  You can do it and there are terms in the model to accommodate that, of course.  It's just not common.

            DR. BLUMENSTEIN:  Actually, you took my question as a suggestion maybe of changing the study design.  I didn't make it clear.  What I'm actually concerned about is whether the methodology that's been developed to give p-values, estimate variance, which you rightly point out are the big issues here, whether those properly account for intra-observer variability in their use of the scales?

            DR. WAGNER:  I believe it does and I'll tell you why.  The full model has seven terms.  I won't take you all through all of those seven terms.  Pure case, pure reader, various interactions.  One of them is a three-way interaction between modality reader and case.

            That's the sixth term.  The seventh term is what you're talking about.  It's the lack of reader reproducability.  If you do enough experiments, you can identify so-called in statistical language.  You can separate these two.  If you don't do the right experiment, you can't but they get lumped together.  The term you're trying to get at is the reader inconsistency.  That is sampled in the experiment but it cannot be identified.  It cannot be broken out but it is in there. 

            In fact, the way we do it is we do it with a family bootstrap experiment so we can actually put out all these effects but we cannot pull out the MRC from the epsilon.  They come together.  That represents not only this three-way interaction but represents the inconsistency of all the data sets together.  So that is actually in there.  Are you surprised?

            DR. BLUMENSTEIN:  No, no, I'm not.  But since you don't measure that in the experiment, you can't estimate it obviously.  That's the issue.  I guess what I've been concerned about ever since I first heard about the use of ROC curves where the reader is recording their result on a subjective scale either categorical or probability or whatever it is.              It's a device to get you to the point of being able to use ROC methodology.  What has always concerned me was that there was this underlying source of variability that wasn't taken into account in the models that you are estimating.  It's only if you do the experiment that way that you actually get an estimate of that intra-observer or whatever you called inconsistency or whatever. 

            DR. WAGNER:  Right.

            DR. BLUMENSTEIN:  I just wondered whether the degree to which this has been studied in actuality.

            DR. WAGNER:  Not very much because of the bang for buck point.  As you can see, if you are inconsistent with yourself, and everyone is, that will show up in case to case within a given experiment but you won't be able to peel it out but it's in there and it's accounted for in the inference.  It's a subtle point but we can discuss it.      

            DR. TRIPURANENI:  That was an excellent presentation, Dr. Wagner.  We used the MRMC for the intra-observation.  If you are looking at two different modalities such as a chest x-ray or a cat scan, have you looked at whether there is any difference in the intra-observation between one modality to the other modality?

            DR. WAGNER:  It turns out to be a really neat point actually.  Our own group has three papers on this subject.  In the first one, you want to know if you can see the difference in the variance structure between the two modalities.  Is that what you're asking?

            DR. TRIPURANENI:  That's right.

            DR. WAGNER:  There's a model that has six terms.  We were just talking about that.  There another model that -- you would think you would have to go to 12 terms to do that.  It turns out there is a parsimonious way to do it with just nine terms but two ways to do that. 

            When you do it you find out that the extra issues brought up by the wrinkles you were just discussing, they come in in such a way that they average and it's only their average that goes into the inference so you can forget about the issue.  It's a really interesting issue.  We have two papers on it.       But you could forget about it.  You could from right off the metro just hear about this and say, "I'm going to use the DBM software."  You could forget about the difference in the variance structure across the competing modalities and if you do, the inference is still the same inference.  It doesn't matter.  It's a really interesting point.

            DR. IBBOTT:  Dr. Solomon.

            DR. SOLOMON:  How do you -- I mean, I have a feeling this topic is going to be discussed throughout the day but how do you translate changes in ROC curves into clinical significance?  Especially since if you look at an individual's change in the ROC one person might do worse and another person might do better and then how do you make that determination?

            DR. WAGNER:  Right.  Well, you might have been a fly on the wall in many meetings.  I mean, this is a real issue.  Dr. Sacks will say something about it later on.  All I can tell you is that the most statistical powerful method to get at these differences is the one I've discussed today. 

            We really would like -- well, I take you back to the Yulei Jiang stuff.  We really do want to see those action items.  You can't go from the curve easily to the action items if you haven't measured those action items.  Is that what you're getting at?  I'm not sure I see what you're getting at. 

            You want to know how we can go from this ROC summary and inference to an interference to the clinic.  Is that where you're going?  I think it's difficult.  What we're saying here is what we are doing is we are making a measurement that averages over all these variabilities that we have talked about.  It averages over all that and here's the summary. 

            If you want something more clinically relevant than that, you would have to actually measure the action item, the dichotomization, if you will, and give it error bars.  When you finish the problem is here would be the action item sensitivity specificity for the one modality and here it would be for another one or this way.  Now, what do you do? 

            Suppose they go this way?  What are you going to do at this point if they don't match up sensitivity wise or specificity?  What are you going to do?  There are things you can do but you have to start getting into expected utility analysis.  I didn't mention it but I have some very strong professional opinions on this. 

            I think it's impossible to do that because to do the expected benefit analysis you need to have an idea of the prevalence of the disease and that changes from risk group to risk group so that is a big uncertainty.  You have to have a sense of something called the utility matrix, the number of false alarms that you are willing to trade for a hit, if you will, different from the positive predictive value. 

            You have to have a sense of that utility matrix and you have to actually know the ROC curve already because all these things come in.  I think this is almost impossible to do without this being taken on at a national level. 

            You can see from the data of Beam, et al. each one of these people thought that they were working out the optimal operating point and have completely different points of view.  What I'm saying is that's an important question. I think it's a societal question. 

            I think it's very complicated and it calls for a lot of wise people with a lot of data to sit down with professional societies and say, "Where are we and were do we want to be?"  This is a really big issue.  I don't have an easy answer.  I insist to my colleagues there is not an easy answer.

            DR. IBBOTT:  Brent.

            DR. BLUMENSTEIN:  I think it is the key question.  What we are asked to do here is to basically judge whether this difference in the area of an ROC curve --

            DR. WAGNER:  That's right.

            DR. BLUMENSTEIN: -- has any translation to the clinical setting.  What we're lacking we have a measure of the significance of the difference in the area of the ROC curve.  What we don't have is a measure of uncertainty around the clinical interpretation of the ROC curve. 

            This is what is particularly bothersome to me is I don't know how to do that and I don't see any methodology that gives me that answer.  I'm concerned that we have started building a building with a foundation using subjective scales to measure things so that we can use ROC methodology and we are using resampling methodologies to do this. 

            We're not taking into account all the various sources of variability and so forth so we are way out there and our foundation may be collapsing and not giving us what we need with respect to the clinical outcomes.

            DR. WAGNER:  Well, if this was broadcast on academic TV today, apoplexy would abound in the community because we all feel we are building, as you say.  We're building on decades of people trying to measure complex perceptional phenomenon.  This is where we are right now. 

            It may not be the ending point to which you would like to be but this is about the best of where we are at the moment.  I tried to challenge you a moment ago if you wanted to work on any action oriented clinical endpoints, I think it's very difficult to sort that out. 

            It's very difficult because you'll get bigger error bars and it's very difficult because the expected utility problem is one that every person in this room has a different answer to that problem.  I think it's very difficult.  I agree with you that we are constantly besieged by our clinical colleagues who would like to have better answers to this problem.

            One case which is kind of unambiguous is the Yulei Jiang's data that I showed you had an ROC curve that went up.  The unaided condition was lower.  The action item, the dichotomization went from a certain sensitivity to a higher sensitivity and a lower false positive fraction. 

            I think everyone loves that scenario.  Wouldn't you say?  That's the world we want to live in.  Right?  That doesn't happen a lot.  These more ambiguous things happen more often.  So what we can do is average over the relevant parameters and say this is what we found. 

            In principle if one ROC curve is higher than the other, in principle one can operate at a given false positive in one modality and increase the sensitivity.  For every time B is higher than A, if the specificity is here and the curve is everywhere higher, in principle I can operate at a higher sensitivity.  In practice how to do that, wide open.  This is a professional society issue that is bigger than all of us.  That is a really tough question.  I agree.

            DR. BLUMENSTEIN:  And just to throw one more complicated issue into all this is that a lot of this stuff that you presented here assumed that the modalities were assessed independently.  In other words, modality A versus Modality B but the experiments that we are asked to look at are modality B added to modality A.

            DR. WAGNER:  Right.

            DR. BLUMENSTEIN:  Where the experiment itself has built-in constraints with respect to how one behaves in doing that.  I don't see that taken into account.

            DR. WAGNER:  No.

            DR. BLUMENSTEIN:  And I'm concerned about that.

            DR. WAGNER:  This is a point of confusion. I would disagree with you.  The modality A here is the reader unaided.  Modality B here is adjuvated, the reader aided by the computer aid.  This a standard paradigm and it actually corresponds to an experiment in the real world that you would like to do. 

            It may not line up exactly with the clinical setting but you actually would want to know something about the performance of readers unaided and then you want to know about how they would perform in the aided condition.  That is actually the comparison of interest.

            DR. BLUMENSTEIN:  I realize that but the way in which the data are recorded is such that the judgment -- as I understand it, the judgment under A is there and has never backed off.  You could only improve.

            DR. WAGNER:  Oh.

            DR. BLUMENSTEIN:  And that's not taken into account in any of these models that I see.  All the models that you presented, everything that you said, is based on having an independent assessment of the two modalities.

            DR. WAGNER:  Well, you have also touched on something that we have had a lot of discussions on.  These are real issues.  I'm not making light of anything you're talking about here.  One hopes the day will come when these modalities are really good.  These computer aids are really good and then you'll be allowed to back off.  You could depend more heavily on the modality. 

            Today people are being encouraged not to back off but the measurement doesn't require them not to back off.  They are just encouraged, "Do not back off," and there is a basic reason for that I think Dr. Sacks will explain later on so people are encouraged not to back off. 

            But when the systems are really good as they are in mammography, these computer-aided systems in mammography are almost flawless for picking up clusters of microclassifications.  They are far from perfect for masses but they are almost flawless for microclassification clusters so readers have thrown away their eye loops, a lot of them that are using these systems so they are willing to depend on the computer. 

            I'm just giving you the only anecdotal evidence.  You have a really good point.  I don't have a really good answer to it but in principle it doesn't have to be this way.  At the moment it is this way.

            DR. IBBOTT:  I would like to remind everyone we will have time to discuss this specific proposal in front of us later on this afternoon.

            DR. STARK:  May I ask a question exactly the point of the presentation, I believe?

            DR. IBBOTT:  Yes, please.

            DR. STARK:  Using the classic -- thank you.  That was an outstanding presentation.

            DR. WAGNER:  Thanks.

            DR. STARK:  Let me just get to the point because I know we are running short on time.  With a better test the AB test in come context in terms of clinical utility, either one that had less scatter.  You showed the Beam paper where the radiologist skills cause scatter in the distribution of the family of curves. 

            It would seem to me that there would be two criteria applicable here where we have a different choice where the test with the larger Az is not the better test if that test is less flexible -- I'm sorry, has a larger scatter in terms of variability of radiology performance, radiology implementation creating a management problem, the implementation problem and then the clinical utility problem where all of the fabulously sophisticated group here are focused on.

            The other area where the larger Az -- so if there is more scatter in the test with the larger Az, it will likely be an inferior test, more cumbersome, more costly, less safe and less effective in clinical utilization.

            The other thing is that if there are two tests with comparable scatter but is easier to train with experience or inexperience, so if you have a trained panel of readers like you do under these study conditions under very circumscribed conditions where they know they are in a test and are not distracted by clinicians, by the busy realistic environment of all mammography or chest CT practices, you can have a curve that is more pliant in the direction that you want doctors to either start at with distractions or to move into with experience so it does seem to me that the scatter or the flexibility of the performance. 

            The ROC curve I think is unassailable and I have learned -- I have enjoyed a ton here learning from Dr. Blumenstein's analysis, yours, and those of you have seen whatever I wrote here.  My group had to do this 20 years ago.  We published papers on ROC analysis and I know we're on the right -- I believe we're on the right foundation.  

            I think this is the right place to start but the breath of the challenge facing us all here today is let's not get obsessed with the ROC curves.  I know we have the whole day for this but the safety and effectiveness of this is going to be what happens when you drop into a clinical environment. 

            And we have a lot of experience with breast and this panel has a lot of people experienced on it but can you tell me if you would agree that we need to see the scatter in these Az plots and know how they respond to inexperience or training to really know of the larger Az is better.

            DR. WAGNER:  Well, I would say that I think there is a little bit of second order phenomena here that is important.  Just because something is second order doesn't mean it's not important.  For the practical inferences that have been -- the endpoints of studies we've seen to date, it has been the performance in the mean. 

            People have addressed that.  There is software.  We have several papers on how to do just what you say and how to split out every piece so we can see how much variation is from the cases, from the readers, from the various interactions.  There is actually software to do that and we are encouraging people who operate at a higher level, say NCI or some academic consortium, to address these very issues and we can see it.  We know how to peel all this stuff apart.  As far as the inference on the table today, it was not done.

            DR. STARK:  The burdens would be huge.  I mean, the sample sizes, the whole time period, the number of people that have to be involved.

            DR. WAGNER:  That's right.

            DR. STARK:  That's why you talked about the need for national studies and we would all like to do that in oncology and everything but we have to treat people and make decisions today.

            On the other hand, let me ask my final question.  Are you aware, or is anybody aware of any evidence that a p-value or some other statistical measure comparing your test A, B under whatever conditions, today's conditions or the ones I am dreaming about, we hope it has some clinical relevance but couldn't it all be counter intuitive?  I mean, this is a very subtle business and couldn't we be missing the forest for the trees here?

            DR. WAGNER:  Again, that's a very wise question and I think that is why we have several medical officers involved in our center on the panel here so I'll defer to them.

            DR. STARK:  So the p-value of .003 doesn't necessarily mean a thing.

            DR. WAGNER:  I defer to my clinical colleagues for that.

            DR. STARK:  Thank you.

            DR. IBBOTT:  I want to make sure that we give Dr. Mehta a chance to ask a question if he has one.  Dr. Mehta, do you have any questions?  He may not be able to hear me.

            DR. MEHTA:  No, I don't have any questions.

            DR. IBBOTT:  Thank you.

            All right.  We are a few minutes ahead of schedule at this point so we'll take a short break.  Let's make it 10 minutes and we back at 10:50.

            (Whereupon, at 10:40 a.m. off the record until 10:55 a.m.)

            DR. IBBOTT:  Take your seats, please.  I'd like to continue the panel now if you will take your seats, please.  For those of you who are like me are concerned, we are getting the heat turned down in this room.  At least in one sense.

            We will now proceed with the sponsor's presentation which will be introduced by Dr. Kathy O'Shaughnessy who is Vice President of R2 Technology.  Dr. O'Shaughnessy.

            DR. O'SHAUGHNESSY:  Thank you very much.  Dr. Ibbott, we are very pleased to be here today to present our image checker CT CAD software.  I would like to introduce the attendees that are here from R2 and some consultants that we have come to -- we have asked to be here today to both present and answer questions from the panel. 

            Besides myself from R2 Technology there's Dr. Castellino, our Chief Medical Officer; Dr. Wood who is the head of our CT Products group; and Mr. Schneider who is the lead algorithm architect that designed the algorithm that we are reviewing today.

            In addition, we have asked the following people to join us.  Dr. Delgado was a beta user of the system so he can describe a little bit about his experience using the system at his facility.  Dr. MacMahon is a thoracic radiologist from Chicago with extensive experience in both CAD and ROC research.  Mr. Miller is a biostatistician for the study.  Dr. Stanford was one of the site investigators where we collected cases from one of the sites.

            Here is a brief overview of our agenda.  After my introduction we'll go into the current clinical practice for some background on lung CT and, in particular, the detection and management of nodules and lung CT images.  Then we'll describe the device both in terms of how it works and how the user uses it. 

            The clinical study will start first with how we collected the cases that were used and then go into detail into the methods and results from the clinical study.  After that we'll have a brief discussion, presentation about the beta test that describes a little bit about the usability of the system.  And I'll finally summarize.

            Before we move into the presentation, I wanted to put out our proposed indications for use of this device.  I thought it was important to go over this to sort of put what we are presenting today in context.  The image check for CT is a computer-aided detection or CAD system designed to assist radiologists in the detection of pulmonary nodules during review of multi-detector CT scans of the chest.

            It's intended to be used as a second reader alerting the radiologist after his or her initial reading of the scan to regions of interest that might have been initially overlooked.

            I would like to ask Dr. MacMahon to come to the podium, please.

            MR. MacMAHON:  Thank you.  Again, I'm Heber MacMahon.  I should say I have a small equity in R2 Technology.  The company has also paid my time and expenses for this meeting.

            I would just like to make some brief comments about the actual clinical practice of radiology as it relates to thoracic CT scans and the importance of detection of pulmonary nodules.

            Some of the common indications for performing thoracic CT scans would include characterization of an abnormal finding on a chest x-ray.  In this situation an abnormality may have been detected and the purpose of the CT scan would be to characterize it as possibly a lung cancer.  And in addition to detect additional abnormalities that might be relevant such as metastatic nodules.

            We also used thoracic CT scans extensively for staging and monitoring lung cancer and other kinds of tumors.  In this situation we are looking not only for pulmonary nodules, but also for enlarged mediastinal lymph nodes and upper abdominal abnormalities.

            In the case of extra-thoracic tumors we are commonly also looking for pulmonary modules and for enlarged lymph nodes in the mediastinum.  Then there are a range of other applications of thoracic CT some of which are developing and will be used more extensively such as detection of pulmonary embolism.           However, in all these situations, although the pulmonary nodules are not the primary focus of the examination, there is an opportunity to detect pulmonary nodules that may be present in the lungs of these patients.

            Finally, lung cancer screening which is investigational and depending on the outcome of the ongoing NLST study may be used more widely.  And, of course, in lung cancer screening pulmonary nodules are the main focus of the investigation.

            But the point I would make is that lung nodule detection is a requirement in every chest CT scan no matter what the original clinical implication.  Only when the radiologist has detected a nodule can he or she decide what course of action is then appropriate.

            There are various management strategies that can be used to manage a pulmonary nodule.  In order to determine whether it's an actionable nodule, we need to consider the size.  Generally larger nodules are more dangerous and more likely to be cancerous. 

            We consider the shape whether it's spiculated, ground glass, and so forth, whether there's been integral change from a previous examination in the same institution and that would be part of the normal diagnostic process to make that comparison.  We would consider, of course, the clinical context, the age and gender of the patient, smoking history, and so forth.  There are a number of factors that play into that decision in addition to the image itself. 

            If the nodule is considered actionable, we can recommend a number of courses of action.  One of the most common would be to obtain outside prior imaging studies from other institutions.  If we can establish stability over a period of time, no further action may be necessary.

            Follow-up CT scan might be prudent at anything from three months to 12 months depending on the nature of the nodule and the radiologist level of suspicion.  Other kinds of imaging studies such as a PET scan may be applicable, especially in larger nodules that are in the range of 8 to 10 millimeters.  This may distinguish cancer from a benign nodule,

Finally, we can consider biopsy, either transthoracic needle biopsy, bronchoscopy, or thoracoscopic resection.

            Just to illustrate the clinical problem, here is an example of a very small pulmonary nodule which I think might easily be overlooked in clinical practice.  It's almost indistinguishable on the single section from surrounding blood vessels but this is, in fact, a small lung cancer which was detected one year later, as you can see, at which time it is much more advanced.

            So this is a very challenging problem for radiologists to visually attack these very small nodules and CT scans.  We are aware that we do miss nodules and I'll just cite two particular studies of interest that have addressed this issue of missed nodules and CT scans.

            Dr. Hartman and others at the Mayo Clinic looked at over 1,000 screening CT scans and compared them with prior screening CT scans one year earlier to see how many nodules may have been overlooked.  They found that as many as 24 percent of the prior prevalent scans had nodules that were not recorded at that time. 

            This might seem an astonishingly large number but this is consistent with some other studies.  Now, a large number of these nodules were relatively small put more than one-third of them were about three millimeters and in the size range where they are likely to be considered actionable. 

            And, in fact, 6 percent of them had grown which would mean that they were highly suspicious for lung cancers so there seems little doubt that nodules are being missed even in excellent centers such as the mayo clinic in a study that was focusing specifically on the detection of nodules. 

            One other study performed by Gruden and others at Emory University looked at 25 patients with presumed lung metastases.  These patients had soft tissue sarcomas and melanoma and they established truth by consensus which is a practical method using five readers.  These nodules were three to nine millimeters in size and they were solid nodules.  Two to nine solid nodules in each case by consensus.

            They found that the miss rate for individual readers ranged from 20 percent to 39 percent of all of the nodules in this size range.  This was in an observer test setting where the readers were focused on detecting nodules and presumably had no other task in mind so one would expect a relatively good performance in that situation.

            So between these two studies we can see that there is a considerable problem with oversight errors in reading CT scans.  Now we have a trend towards thinner CT sections with the newer multi-detector scanners.  This allows improved ability to detect and characterize lesions.  It does allow us to do a high quality off-axis reconstructions. 

            On the other hand, it does present us with more image data, more opportunities for error.  In a chest CT scan performed with a multi-detector unit we may have anything from 18 to almost 300 images of the chest and the radiologist has to interpret those visually. 

            I think that the evidence that we've seen strongly suggest that traditional visual interpretation is no longer sufficiently reliable for detecting these very small and potentially dangerous common nodules.

            At this point I would like to introduce Ronald Castellino, Chief Medical Officer for R2 Technology.

            DR. CASTELLINO:  Thank you.  My name is Ron Castellino.  I'm also a diagnostic radiologist but currently I'm the Chief Medical Officer of R2 technology.

            At the outset I'd like to particularly emphasize the definition of computer-aided detection which is also called CAD as we will be using it in the presentation today.  Computer-aided detection as we use it refers to the availability of computer algorithms that automatically identify regions of interest on a medical image for the radiologist to evaluate.

            It's purpose, of course, would be to decrease what I would term observational oversights.  That is, findings that are present on the image but, in fact, are not seen by the radiologist.  This is not a device to tease apart very unusual nodules that might not be present or barely present on the image.  These nodules are actually clearly visible on the image.

            The image check for CT CAD system specifically is designed to automatically detect regions of interest with features suggestive of solid pulmonary nodules on CT exams of the chest.  It's important to remember that it is to be used as a supplemental review.  That is, after the initial assessment has been made by the radiologist.  It is not a first reader.

            The radiologist, most importantly, remains responsible for the final interpretation of the findings that the CAD marks may put on the image.  That is, to determine if the mark is actually a true mark or if it is a false mark.

            A brief review of the device description.  The CT scan is performed in the standard fashion.  The images or the data set is moved to increasingly types of work stations that radiologists review the images on and what is what we call a soft copy display.  These images may be reviewed slice by slice but increasingly they are reviewed in some type of a melt-through or a cine mode to facilitate reviewing these hundreds of images that are generated.

            By the same DICOM standard the data set can also go through a server computer.  Various image analysis algorithms can be put into place.  In this case, I point out segmentation.  This type of information can also be transmitted to the work station to help the radiologist further analyze the images and this is an image checker CT work station which was cleared by the FDA in 2002.  This is an existing product that has been cleared.

            The same DICOM data set can also go through an image checker CT CAD software system and provide on the work station CAD information as well.  It is this specific piece of the product that is under review today by the panel.

            I'll show you a few screen capture images of the front end of the work station on which the CAD marks are displayed.  The view port on the right is familiar to radiologists.  This is where we can see the axial images.  I guess I can't use this thing.  Thank you.  We are a high-tech business as you can see.

            There we go.  On the large view port on the right we can see the axial image displayed to the radiologist which is viewed either singularly or, like I said, melt-through a cine mode.  The smaller view port on the upper left is a three-dimensional reconstruction of the contents of the lung. 

            You can see the pulmonary vessels.  In fact, a few nodules perhaps you can see there.  And the horizontal lines simply indicates to the radiologist what level on the image the axial image is displayed.  We see a nodule here quite clearly in the right apex.

            The radiologist then will move down the entire sequence of the lung in the lung windows looking for other abnormalities, nodules as well as a multitude of other features that the radiologist searches for sometimes seeing nodules and sometimes not seeing nodules. 

            When they completely review the entire study, which I'm giving to you in a very schematic fashion here, the radiologist then will activate with a mouse click the CAD button we call the R2 button.  At that point in time the CAD process takes over and presents the following. 

            The circles indicate candidate nodules that the CAD system has identified shown to the radiologist on the three-dimensional display of the lungs, as well as brings the radiologist automatically to that specific site where the nodule is best seen by the CAD system.

            In addition, out other view port on the lower left is shown.  This is a three-dimensional reconstruction that can be rotated to separate the nodule out from adjacent vasculature.  I would like to emphasize that upon the CAD review the radiologist need not go through the entire data set once again but simply by moving and hitting one of these little buttons here with a mouse click which you can't read here.  It automatically jumps the image.  By the way, the size is automatically shown as well. 

            It automatically jumps the image to the next CAD detected nodule and the next and so forth.  For example, this nodule, as I showed you and, for example, a nodule at the right base which is clearly a nodule but, in this case, had been overlooked by the radiologist on the set of images.

            That is the CAD display on the work station.  What does the CAD search for?  It is specifically designed to search for solid lung nodules that are 4 mm. or greater in size and we find that further as follows.  They should have an approximate spherical shape. 

            The margins can be smooth, lobulated or spiculated and should have soft tissue density which we define as having average density of minus 100 Hounsfield units or greater.  Some of the typical CAD marks you've seen already.  They circle the nodule.  We consider this a true mark if it actually encompasses the size of the nodule sometimes quite small, moderate in size. 

            I would like to emphasize that also although we look for spherical nodules if, in fact, the nodule is adjacent to a plural surface where a portion of the sphere is obliterated by contract with the plural surface.  The algorithm tries to find these as well.

            Secondly, this image perhaps some of you can see, although it is easier for the radiologist and the CAD system to detect a nodule that is surrounded by completely normally aerated lung, if there is adjacent modest non-aerated lung as we see here in the appended edema, the CAD algorithm often is successful in teasing out the nodule as well.

            There are a multitude of other parenchymal abnormalities within the lung tissue that the CAD algorithm does not search for.  The radiologist must look for these but the CAD algorithm does not search for.  For example, linear strands which do not fit the criteria.  I would like to point out importantly although this fits the criteria of being a spherical nodule, we call these ground glass opacities. 

            They are increasingly noted to be of importance, particularly for lung cancer screening programs that because of the Hounsfield density cutoff that we have, this type of nodule currently is not searched for with our set of algorithms.

            All CAD systems have false marks.  We see a few here such as this one here where a branching vessel exist.  The CAD algorithm thought this was a nodule and marked it incorrectly.  Plural tags are at times marked incorrectly.  I can tell you that our experience internally as well as with users indicate that the vast majority of these false marks can be readily dismissed as you see here.

            As an aside, we have found that a regulatory database a median of three false marks per exam.  I would like to emphasize this is per exam.  There is a median of 160 images per exam so we're talking about approximately one false positive mark for every 50 to 55 individual images.

            Now, the clinical study was designed around an ROC study as you've heard from Dr. Wagner.  It was done in close collaboration and support with the people from the FDA.  The ROC study in a large extent does measure -- a combined measure of efficacy of safety.  There is some discussion about that and Dave Miller will fill you in on that as we see it, at least.

            There are three parts.  We've collected cases.  I'll review that.  These cases were sent to a reference truth panel and finally to the MRMC ROC study which you'll hear about from Dave Miller.

            I would like to spend only a brief comment upon the target of nodules.  You've heard from Dr. MacMahon that we are increasingly seeing smaller nodules on our CT scans and our clinical practice.  We wanted to design a CAD system to help radiologist detect all solid nodules between 4 and 30 mm.  That was the focus of our research effort.

            And, as you are well aware, those in the clinical practice you will recognize that most lung nodules most of the time are typically sampled by biopsy or thoracic resection if they are 8 or 10 mm. or so greater in size.  There are obviously exceptions to this but, in general, they are. 

            The availability of a biopsy proven so-called gold standard to evaluate nodules in this smaller size range was just not available to us.  We settled on a gold reference standard of a consensus on actionability as being the only practical standard that would capture all solid nodules of clinical concern in this size range.  We are really focusing and trying to help the radiologist in the 4 to 8, 10 to 12 mm. range.  The larger nodules, of course, radiologist will almost always see.

            We collected cases from five centers.  They contributed consecutive non-selected cases.  We tried to make this as representative as possible.  They were all in adults.  They were performed for a variety of clinical indications. There were no screening studies in this group. 

            Cases with greater than 10 nodules were excluded.  We felt that there were a multiplicity of nodules.  The issues of searching for nodule where the radiologist has already seen 8, 10, 12, 15 would be reported.  The images, of course, have to reach certain technical parameters.

            These cases were divided into two categories to begin with by report.  The nodule-present cases had in the report the presence of one nodule or more described by the reviewing radiologist.  These patients by definition had a history of biopsy proven documentary cancer either primary to the lung or in an extra thoracic site. 

            We did this to try to increase the likelihood that nodules in this group might have clinical significance because they were in patients with cancer but I would like to point out that the specific nodules themselves were not biopsy proven.  The nodule absent cases, once again by report, no nodules were described within the context of the report.  These patients could have a history of cancer or not.

            The final truth was determined by the reference panel which you'll hear about from Mr. Miller.  Five sites contributed to the study.  Three of these are community imaging centers, two are university centers.  They were from the east coast, mid-west, west coast.  There were 63 cases that had nodule-present by report, 88 nodule absent by report.

            You can see the distribution between male and females were similar.  The age range was similar in the two groups.  There was a slight increase in median age in the nodule-present cases perhaps because they all had documented histories of cancer as compared to this group.  The type of cancer in the nodule-present case, 38 percent had a documented primary lung cancer and 62 percent had documented extra-thoracic primaries.

            Here are some of the parameters of the technical aspects of the case characteristics, the median number of slices you see here.  There is a slight predominance of thinner slice sections in the nodule absent cases mainly because one of the centers was doing much thinner slices routinely and they contributed a larger amount of nodule absent cases.

The CT vendor's use in these five sites were General Electric or Toshiba.

            I would like to ask Dave Miller to present the methods and the results of the study.

            MR. MILLER:  Thank you.  My name is Dave Miller and I am currently the Director of Statistical Analysis at Ovation Research Group.  At the time that this study was conducted I was the Director of Biostatistics at R2 Technology.  R2 is paying for my time and travel.  However, I do not have any financial interest in R2 Technology.

            Just want to quickly go through an outline of what I'm going to discuss because I'll be up here for a little while.   I'm going to go through some definitions that I'll be using during the talk.  Then I'll talk about the reference truth panel.  I'll talk about the ROC study design, our primary analysis.  Then we did a large set of robustness analyses.  Then finally the study conclusions.

            So gold standard, and these are definitions that I'm going to use.  They are not necessarily dictionary definitions of these but gold standard is something that I'll define as an objective and definite measure of truth.

            The reference truth is a truth standard for a subjective construct.  It is a term that is fairly widely used and it's a term that I'll be using here as a standard that's used in lieu of an available gold standard.  The kind of thing that reference truths are used for are things like actionability where actionability is something I'm defining as a subjective point-of-care decision which is really what we're targeting with actionable nodules. 

            Nodule also is a subjective definition.  It's a subjective characterization of a lung abnormality.  Finally, a panel is a group of radiologists with a given task.  In this case, their task was to identify and characterize actionable nodules.  Consensus is a term I'll use only for unanimous agreements.  When you hear we use consensus, that means unanimous agreement as opposed to majority agreement.

            Then, finally, a few study definitions.  I'll run through these very quickly because you've got a very nice tutorial from Bob Wagner this morning.  The ROC curve is the receiver operating characteristics curve.  AZ is the area under the ROC curve, the measure of interest in the study. 

            MRMC stands for multi-reader, multi-case.  I'll use the term primary analysis for our protocol specified primary analysis and the term ANOVA-after-jackknife.  The ANOVA there is analysis of variance and you've got a nice description of both the jackknife and the bootstrap earlier.

            So under the reference truth panel the goal of the reference truth panel was to fully identify all nodules in the case sets.  These are the cases that Ron described how they were collected.  We wanted them to rate the actionability of any nodules that they found.  Specifically we are defining actionable as a nodule that requires surveillance or intervention so it could be follow-up or it could be more of an intervention.

            We define the reference truth so that we could use it in the ROC study.  The method was to have a panel of three radiologists independent review the cases and we followed a two-path process to reduce observational oversights.

            The reference truth panel qualifications were that they needed to be board certified radiologist, that they had at least six months of reading thin slice which we defined as less than or equal to 3 mm. collimation CT of the chest, and they needed to have experience with reading soft copy.

            A total of 11 panelists participated in at least one of the three-member panels that were convened.  Just to be clear, we didn't have a single three-member panel because it just would have taken weeks for three people to review the set of cases that we had.  We had a succession of panels and there were a total of 11 different panelists that participated in at least one of those panels.

            Nobody participated in more than three and obviously nobody participated in less than one.  This is how the panels worked.  We brought the radiologists in and we put them in three different rooms.  This is after a brief sort of training that we gave them prior to going to the three different rooms.  They had three different work stations set up and they each independently reviewed a set of cases.  In a typical sessions we had about 20 cases reviewed.

            After they had reviewed all of the cases for a given day, and this usually took maybe four or six hour or so, we took the computer files of all of their findings and these are findings of the exact locations and we brought them together to get the union of all findings so that redundant findings were captured and we knew every finding that any panelist had found.

            This is a little hard to see up there but we also at this stage excluded nodules that were less than 4 mm. in size or greater than 30 mm. in size.  Those were protocol exclusions and we had asked the radiologists not to spend too much time taking precise measurements as they were doing this.

            After this there were 95 findings where three our of three of the panelists agreed that it was a consensus actionable nodule.  I couldn't say consensus.  Three out of three agreed and, thus, there was a consensus that it was an actionable nodule.

            Now, there was also a large set where there was disagreement.  Either one out of three or two out of three of the radiologist had identified the finding and the other radiologist either had overlooked the finding or didn't feel that it was an actionable nodule.  These went to a second pass. 

            The way the second pass worked is that after about half hour of prep or so they went back into their individual rooms so they didn't come together and talk about the cases.  They each went back to their individual rooms and they had the locations of each of these disagreement findings identified for them.  So the second pass went fairly quickly because they didn't need to go through the whole case.  They were just looking at and being directed to specific spots and being asked to rate the actionability.

            After this there were 47 additional nodules that went into our truth set of unanimous nodules.  There was also a fair number that went into what we call the majority group, that two out of three felt that it was actionable, and a minority group that one out of three felt that it was actionable.

            Our primary analysis focuses on consensus agreement but we did do some robustness analyses around the majority and minority.  I'll be talking about that later but for now I'm focused on the unanimous nodules.

            So as a result of this process the eight three-radiologists panels.  I told you there was a series of panels.  There were, in fact, eight of them.  They identified 142 consensus nodules in 65 nodule present cases.  You might notice that number 65 is slightly different than the 63 number that you saw earlier.  That's because now our consensus panel is the definition of truth for this study.

            You can see the size of these findings.  The median size was 7.9 mm. and there were a lot of them that were in the 5, 6, 7 millimeter range.  The remaining 86 cases were categorized as nodule absent by virtue of not having any of the unanimous nodules in them.

            So moving onto the MRMC ROC study, the objective of this study per protocol was to demonstrate that review of CAD output improves performance of radiologists reviewing MDCT with respect to their ability to accurately identify actionable nodules.

            Our outcome measures were AzB.  That is, the before CAD area under the curve, AzA, that is the after CAD, the area under the curve and, most importantly, Azdelta.  This is basically the difference between the two curves.  And the hypothesis in a formal statistical sense -- the null hypothesis was that the mean change in the area under the curve was zero and the alternative hypothesis, of course, is that Azdelta is greater than zero meaning the CAD did have a benefit.

            The study was conducted in two phases.  We first did a 32-patient study and then after doing that study we had some discussions with FDA and we outlined what would be the appropriate methodology to use for a second study, what the appropriate size for the second study would be based on the type of methodology that was suggested.  So I'm going to be talking about that second 90-case study as the focus of this talk.

            The reader qualifications for the ROC study, so this is, again, new set of readers.  Don't confuse them with reference truth panel.  Completely different people.  It would be wrong to have the same people.  These people had reader qualifications that they be board-certified radiologists and have at least three months of reading MDCT of the chest.

            The basics of the study is that we have 15 readers read all cases.  We had 90 cases.  Of the 90 cases 48 had at least one actionable nodule and 42 did not have any actionable nodules and that was based on a stratified random sample of our complete set of cases. 

            There were, of course, four quadrants per case by definition but the important point is that these quadrants, all four of them, were rated pre-CAD and then sequentially post-CAD.  The ratings were finally evaluated against the reference truth so the ROC curves were drawn by comparing the ratings which were on a continuous scale to the reference truth established by the panel.

            I want to clarify what the unit of analysis is because I know people have a tendency to want to sort of track the numbers as they go through the slides and see where things add up so, just to be clear, nodules were the unit of analysis for the reference truth.  The reference panel was supposed to identify every nodule. 

            Quadrants -- the quadrant truth was computed from the nodule truth.  For instance, if there was a quadrant that had one actionable nodule and one non-actionable nodule, the quadrant was, nonetheless, considered nodule-present quadrant because it had at least one.

            On the other hand, if there was a quadrant that had a minority nodule in it, in other words, a nodule that at least one person on the panel thought was a nodule but not unanimous, that was considered a nodule absent quadrant.  Every quadrant counted in every analysis that we did.

            Now, the reason that we went with this quadrant approach is that the LROC methods were not developed at the time that we embarked on this for multi-read, multi-case studies.  I think they probably will be in time and they may even be right now but at the time we began the study, they were not.

            Bob Wagner described it a little bit as these being sort of competing fields that people that went with the ROI approach versus the people that go with the full localization.  I think really there are two camps that are going after the same thing of trying to get some measure of localization added to the ROC method. 

            We felt that for this particular case where you might have a nodule that was quite large in one lung and then a smaller nodule in a contralateral, that that smaller nodule in some cases might be the really important one that actually drove the care.  We felt that getting at localization in some way was important.  We went with the quadrant approach.

            The quadrants were rated by the ROC readers but then the case, not the quadrant, is the unit of analysis for the computation of the p-values and the confidence intervals based on the jackknife and the bootstrap.  You heard these references mentioned earlier but Obuchowski specifically is the reference for using this region of interest or quadrant approach.  Carolyn Rutter is the person that developed the method of using the bootstrap to sample cases.

            The reading environment for our study is that readers were trained on work station use and we really tried to create a reading environment that was as similar to their individual practices as possible.             So the usual work station controls were available to them.  If any individual reader had a particular window or leveling preferences, they were allowed to modify that.  We didn't have it in the protocol that they had to read a particular way that would take them out of their reading environment.

They were allowed to practice on three cases with the trainer present.  The ambient lighting was adjusted to the radiologist preference.  There was no hard time limit. 

            The instructions given to the readers was to only search for 4 to 30 mm. actionable solid nodules, to rate each case post-CAD immediately after the pre-CAD rating so they had to go through the entire case pre-CAD and provide the ratings before the computer would even allow them to turn on CAD and then provide the post-CAD ratings.

            They were instructed to consider age, gender, and clinical indication.  These were taken from the radiology report.  We did not provide them with the full radiology report as that obviously would have provided too much information for them to be able to make up their own decisions.

            So the basic study work flow here -- let's see which of these works.  Yeah, this one works.  When you saw the work station earlier, there was no blue line.  The blue line is separating the upper quadrant from the lower quadrant.  We didn't feel like we needed a line to separate left and right.  The yellow line is indicating where they are in the exam.

            As they were reading the case, they had the opportunity to bring up a pop-up menu to rate the quadrants at which point they would get this little cartoon of sorts with these slider bars.  They would move the slider bars either all the way over -- you can't see.  There's a little 100 there -- to indicate complete confidence that there was at least one actionable solid nodule present in the quadrant, or zero to indicate complete confidence that there were none.

            In this particular case you can see that the reader has gone through and given a pretty low confidence or, I should say, a high confidence that there are no nodules present in any of the quadrants.

            Having done that they then have the opportunity to click this button up here and turn on CAD.  It's a little bit hard to see here but there is a potential nodule.  I'm not a radiologist.  I won't tell you whether it is a nodule but it is located there in the upper right quadrant.  Then they would have the opportunity to rate the case again. 

            In this case they might have changed their rating.  In the other quadrant since there was only a mark in the upper right-hand quadrant, it's fairly unlikely that they would have changed any of their other ratings but they were allowed to.

            So after doing this with our 15 readers who each read the 90 cases, both pre-CAD and post-CAD, were able to draw the ROC curves for each of the individual readers.  This is just an example of a single reader and so the area under the dash line is the pre-CAD Az and the area under the blue line is the post-CAD Az and then the area in between the lines is the Azdelta.

            These are the 15 pairs of readings.  I didn't produce this plot specifically to answer some of the questions that came up earlier this morning but I think it might answer some of them a little bit.  Now, this is not the same plot that you saw earlier.           This has the pre-CAD area under the curve on the bottom and the post-CAD area under the curve going on the Y axis.  So pre-CAD the range was from about .82 up to .96.  That's the range of the 15 readers area under the curve.  Post-CAD the low end was .86 to .96 so you can see a narrowing of the range post CAD with respect to Az.

            In particular, these three readers who had

-- I'm trying to look for a different word than worst -- had the worst pre-CAD Az performance of around .82 to .84 were the ones that improved the most, or were among those who improved the most.  You might wonder what about readers that did pretty well.  Well, these two readers did very well pre-CAD, at least, measured against Az.  And post-CAD they also had some improvement.  It was a more modest improvement.  They didn't have as much to improve.

            Now, finally, there's this reader up here.  This reader had a nearly perfect pre-CAD performance.  This does just go to .96, not all the way to 1 so they weren't absolutely perfect.  What you worry about with a reader such as this is you don't want CAD to cause them to change their impressions so they get worse and they did not.

            So moving onto the primary analysis this is the average reader ROC curve.  Again, here is the pre-CAD line, the post-CAD line, and the area in between is the Azdelta.  I'm just going to focus in on this part right here because it is an important point about whether or not the curves cross. 

            The curves do not cross and so you can see that they are always apart.  Especially in this area here I think is the area where people are most likely to have their individual operating points, although, as you saw, they might go all the way out here.

            These are the same 15 dots just plotted against a different axis so this is sort of how far away they were from that line.  You can see individual reader improvements ranging from about .06 to zero to no improvement.  And then the idea behind the Dorfman-Berbgaum-Metz ANOVA-after-jackknife analysis is to create a confidence interval and computed p-value that would allow us to figure out what might happen with a new reader with a new case. 

            I mean, that's really the idea of this confidence interval is what kind of performance would we expect from a new reader with a new case.  You can see that both the individual readers as well as the average delta and the confidence intervals are well on the side of CAD better as opposed to the side of CAD worse.

            Now, we went ahead and did a number of robustness analyses and these were basically about repeating the primary analysis varying different assumptions to demonstrate that the primary results are not sensitive to study design.  I think these are very, very important because there is a considerable literature that you can tweak different things and end up with different results.  If we had found that, we would have been in a difficult position because we wouldn't have known whether or not we really did have a robust result.

            I'm going to talk about this with reference to the statistical methodology, specifically the ANOVA approach versus the bootstrap approach.  There are lots and lots of different iterations on this but I'm just going to focus on these two.  I'm going to talk about the reference truth.  I'll focus on the consensus standard versus the majority standard but there are a number of other reference truths that we examined and I'll just focus on those two.

            And then panel variability.  I've talked about the confidence interval being a way of getting at what would happen with a future reader with a future case.  What you really want to know is what would happen with a future reader and a future case evaluated against a new truth, right?

            That means that you don't just have to have the random reader and the random case components of the ANOVA model.  You also have to have some way of evaluating your truth against the random panel if you are going to fully capture the variability.

            So the ANOVA-after-Jackknife compared to the bootstrap, I'll run through this quickly because you heard this earlier.  The ANOVA-after-Jackknife is based on leave one out samples.  Again, the leave one out here is cases.  A case is being left out of each sample as opposed to a quadrant.

            The Az end of the curve has been computed for each reader case combination and then analysis of variance random effects model is fit.  This is the standard analysis of variance random effects model with full interactions described by Dorfman-Berbaum-Metz.

            The bootstrap, I think nonstatisticians a lot of times find the bootstrap a little bit more intuitive.  The experiment is replicated in 1,000 random samples so from our sample of readers in cases, we generated random samples of readers in a random sample of cases and for each sample we matched our random readers with the random cases and repeated the entire analysis. 

            It is very computationally intensive but it gives you a way of coming up with confidence intervals that allow a nonparametric -- fully nonparametric approach to evaluating what would happen with a future reader in a future case.  I do want to point out that the ANOVA-after-jackknife is semi-parametric.  The ANOVA piece is parametric but the jackknife piece is nonparametric.

            So these are the confidence intervals for the ANOVA versus the bootstrap.  You can see that the confidence interval for the ANOVA is a little bit tighter.  For the bootstrap it's a little bit broader.        One of the things that the bootstrap is known for is being able to come up with confidence intervals that are not actually symmetric about the mean because often there is not really any reason to believe that the competence intervals would be symmetric about the mean.  In this case you can see it actually goes out further on the CAD better side.  Even though the competence interval is wider, it does not in anyway diminish the results.

            So returning again to the primary analysis, the primary analysis, as I showed you earlier, is based on a delta Az of .024 and a p-value of .003.  I just showed you a different methodology using the bootstrap and came up with .0246, very close, and a p-value of less than .001.

            Then we went on to a different reference truth.  The different reference truth that I'm talking about here, and I apologize that it's not on the slide.  We didn't want to make it too dense, but this different reference truth is majority so this means that a quadrant would be considered nodule present if there was at least one majority or consensus nodule and it would be considered nodule absent if it did not have any majority nodules in it.

            A really important thing to point out here is that the majority quadrants, the ones that two our of three radiologists in the panel consider to be actionable.  They are included in every single analysis so that means that when we're talking about the unanimous truth, they go in to the false positive side of things, as somebody calls it. 

            On the other hand, if we talk about this reference truth, they go into the true positive side.  We felt like we don't know if those are nodules or not and so the most conservative approach to take is to always put them in every analysis.

            The delta Az here is a little bit lower but the p-value is actually more significant, to use a loaded term.  This has to do, I think, with this sample-sized paradox that Bob Wagner was describing earlier.  The final step was to do the random reference truth. 

            We did the random reference -- actually, before I go to that, I want to mention on the different reference truths in addition to majority and consensus, we also looked at a minority reference truth which is sort of the loosest possible standard we could come up. 

            We also did a tighter truth based on having a second panel of five people look at the cases and define the truth more tightly.  In all four of those cases we came up with a similar statistically significant result.  So the random reference truth is based on picking two panelists at random to review each case. 

            Pretend that the three-member panels didn't exist.  Redo the truth assuming that third person just wasn't there in their room.  When you bring together the first-pass findings, their data doesn't come in.  When you go to the second-pass it's only the two out of two consensus.  This allowed us to come up with competence bounds that captured that piece of the variance.  It ended up being fairly similar, although the delta Az is somewhat diminished from that of the primary analysis.

            So all variations gave statistically significant results.  I'm a statistician so that's what I know best and that's why I'm best prepared to talk to you about.  I take the point of some of the panelists that -- by panelists here I'm referring to you all as opposed to any of our other panelists.

            You want some sense of what does it all mean.  What does this Azdelta of .02 mean?  For myself, I find it useful to think about individual operating points.  This is the pulled curve where we pull all of the readers together.  You can't really translate this to a new reader and a new case. 

            These are analyses that you don't do to find statistical significance or to get a particular competence interval or particular estimate.  There are analyses you do to try to understand the data.  There were analyses that we put in our protocol that we would be doing but they were secondary analyses just to try to get some sense of what's going on here. 

            So this is the operating point of 20.  Recall that we have this 0 to 100 scale so 20 reflects sort of the most aggressive end of the spectrum.  We could go all the way out to 0 but 0 is just all the way at that end.  Twenty was an area where you could imagine a fairly aggressive reader would say, "Even for a 20 I might want to do some kind of follow-up."                 Fifty was indeterminant on our scale so that is one operating point that is interesting to look at.  Eighty would reflect sort of the least aggressive reader.  This is by no means all readers.  If I put this plot out with all 15 of the readers, you get sort of that weird scatter plot similar to what you saw earlier, but just to get a rough sense of what kinds of improvements are maybe plausible

            So this dotted vertical line here is the line that corresponds to having the same false positive fraction.  This is saying that if you started out at 50, your sensitivity could increase by this much without sacrificing your false positive fraction at all.  Not one iota.  If you think of the false positive fraction as your measure of safety and you think of the true positive fraction as your measure of efficacy, that is saying you can go up and get efficacy without any safety tradeoff.

            Now, it's probably more likely that people are going to go a little bit up and over so maybe they are going to call more things.  That's what we see with our individual rating.  You can go up and over and still have the same positive predicted value.  Even though you are giving up a little bit on the false positive fraction, you still have the same positive predicted value.

            This 50 here is still a little bit over from that so it's not exactly the same positive predicted value but the basic point is that you can go up and over without having a sacrifice or without having a substantial sacrifice.

            So these are the analyses that I mentioned.  They were in our protocol as analyses that we were going to do, but I really am very sympathetic to what Bob Wagner said about these numbers.  It's so hard to say what they mean.  What are these numbers.  I don't want anybody to run too far with these numbers but I do feel like it's necessary, especially for people who aren't statisticians, to want to understand what's going on with some of the raw data.

            If we take 20 as the threshold for where somebody -- pretend that all readers treat 20 as their criteria for actionability, then we would have had 16 percent of the total nodules so there were 1, 125 positive quadrants that the 15 readers looked at.  Sixteen percent of those would correspond to misses.  With this very aggressive cutoff I think odds are those are, in fact, observational oversights.

            Post-CAD that goes down to 11 percent so the 16 percent versus 11 percent, that's a 30 percent reduction in misses at that threshold.  Now, that is a very aggressive threshold.  Probably most readers aren't at that threshold.  Fifty might be closer to where most people are at.  It goes from 20 percent down to 16 percent.  That's a 22 percent reduction in misses.

            Then finally if we imagine that 80 is sort of a higher-end threshold of what might be called a miss, there is still a 15 percent reduction in misses.  Now, these numbers are presented without confidence intervals, without p-values.  Take them with a grain of salt.  But in terms of understanding potentially the clinical importance, I think that maybe this may satisfy some of the desire to see a different number than just the delta Az.

            I also wanted to show you what happens if we look at the true positive fraction and we look at the false positive fraction in a way that is probably more similar to the way that a lot of academic studies are done where you look at the cases where you are most likely to see an effect on the true positive side and you look at the unambiguous nodule absent quadrants on the other side. 

            Here I really am throwing out quadrants.  As a statistician I hate to throw out data but I'm throwing them out just to get a clearer idea of what's going on here.  So if we are looking at the true positive fraction just for the smaller nodules, and I'm just using -- they are not really small. 

            I think a lot of people would define small as less than 4 or less than 3, but the intermediate-size nodules as a proxy for difficult to find nodules or easily overlooked nodules.  Then you can see that you get more of a rise in the curve without quite as much of a tradeoff early on in terms of the false positive fraction.  This is analysis that was not included in our protocol.  It's just something that I added to try to get a little bit more understanding of what is taking place here.

            So the study conclusions.  Again, the study conclusions go back to the primary analyses that we did and the robustness analysis.  The study conclusions are that the imaging checker CT improves reader performance for the detection of actionable nodules.  That was our objective and that's what we feel that we demonstrated.  And specifically the results are robust to the analytical methodology, to the choice of the reference truth. 

            Again, it wasn't just looking at consensus and majority.  We looked at minority, majority, consensus, and sort of a super consensus.  Then it is also robust to the additional variation associated with selection of panelists.  I described identifying two random panelists.  We also did it with a single random panelist, with three random panelists and came up with very similar results.

            With that, I'll turn it over to Dr. Delgado. Thank you.

            DR. DELGADO:  Thank you and good morning.  I am Dr. Pablo Delgado.  I'm clinical associate professor of radiology at the University of Missouri, Kansas City.  I also practice at St. Luke's Hospital.  I'm here to describe the beta experience that we're involved with.

            First of all, I'll tell you a little bit about where I practice in the setting, where the beta site was performed.  I am a private institution affiliated with the university.  We have a hospital setting as well as an affiliated imaging center adjacent to us.  We practice with residents available and we have an on-site residency training program of which I am the program director. 

            Our patient base is quite varied and I think rather common place for the region.  It's a typical mid-west community base of private as well as community patients.  Our CT equipment for our radiology department, we currently have two four-channel multi-detector CT scanners which happen to be GE QXI light speed scanners, although I don't think that's of importance to this device as long as it's DICOM data and meets the collimation thickness.

            We currently perform anywhere between 20 and 30 CT studies a day of the chest and these different diagnostic indications including CT pulmonary angiography, high resolution CT of the chest, detection of other lung diseases, as well as multi-organ disease workups.

            The beta study that we performed was between the times of June and August of 2003 for a total of eight weeks.  We processed numerous studies.  However, the goal of the study that we agreed upon and embarked upon was to assess the functionality of this image checker, CAD software, and how we would work with it to answer the R2 developmental group questions about radiologist preferred reading practices as well as work flow issues of how this would be incorporated into our practice.  And to determine future applications of training needs in training radiologists in how to use this device.  It should be noted that we were not asked to assess the clinical effectiveness of the CAD system.

            The design of the system involved retrospective review of CT chest cases from our institution from previous months that have already been acquired and already been interpreted outside of the study and that met the collimation thickness which, I think, was already mentioned, 3 mm. or less and were contiguous slices of the chest. 

            The cases were read by faculty radiologists as well as residents so we got feedback from both experienced radiologist as well as radiologist in present training.

            For the training of utilizing the device, we had an R2 application specialize on site for an entire day who got to work with most if the radiologists.  A few that were not available for that time were given the training subsequently by those who experienced the training from the application specialist.  That training process involved the description of the CAD algorithm, what indeed it does and what it doesn't with the review manual. 

            We also reviewed several institutional cases.  First R2 had some cases of their own.  Then we through the DICOM hookup were able to push some of our cases to the R2 device and process them so they were our cases.  We also performed shadowing of retrospective reading sessions where the radiologists were able to work with the CAD device and subsequently ask questions if they felt that they were necessary or encountered any questions.

            Our observations from using the beta product demonstrated that most radiologists, in fact all, demonstrated a rather rapid learning curve for using the CAD device.  In a rather short period of time most people felt very comfortable in utilizing the product as is intended.

            We encountered no specific technical errors or malfunctions.  We had no difficulties.  We did, indeed, use it in the way it was intended and we asked radiologists to first look at the case in a soft copy reading mode and then subsequently push the CAD button and activate it and then review it immediately thereafter.  We found that all radiologists missed nodules that were detected by the CAD.

            There certainly are false CAD positive marks as Dr. Castellino pointed out.  However, most of these are easily dismissed by radiologists and that includes both faculty and residents.

            Of course, I would agree with the comments made by other panel -- excuse me, other presenters from R2 that we feel that radiologists definitely should review all images initially without CAD and then a subsequent read with CAD.  The reason for this is that CAD is not really made to detect every single nodule and, No. 2, the algorithm is such that it does not detect every single lung abnormality and radiologists are still responsible for detecting any lung abnormality.

            In conclusion, I think that this product is very timely in what radiologists are facing on a daily basis.  The development of multi-detector CT has led to an explosion, if you will, or significant increase in the number of images that are very detailed and radiologists are asked to interpret.

            Numerous published studies have already documented there are limitations in radiologists' ability to detect lung nodules.  I believe the detection really is the limiting factor of eventually determining actionability whether it is related to further diagnostic or therapeutic or interventional workups.  We found CAD to me an effective tool in assisting the radiologist in the detection of lung nodules with multi-detector CT.

            I will now reintroduce Dr. O'Shaughnessy of R2 Technology.

            DR. O'SHAUGHNESSY:  Thank you very much.  I just have a couple of summary slides kind of to bring it all together at the end.  I just wanted to reiterate the main conclusion from our clinical study for multi-detector CT exams of the chest, that the image checker CT CAD software system significantly at a p-value of .003 improves radiologist ROC performance for detecting solid pulmonary nodules between 4 and 30 millimeters in size. 

            And as both Mr. Miller and Dr. Castellino talked about and Dr. Wagner this morning, we feel that is a good measure for -- a reasonable measure for evaluating both a safety and efficacy aspect of the product.  Also from the safety aspect, the product is intended to be used as an adjunctive device and with appropriate training we don't think there are any issues there. 

            Just to summarize, I'll put up again the same slides of the proposed indications for use.  We thank you very much for your attention.

            DR. IBBOTT:  Thank you, Dr. O'Shaughnessy.

            We are going to have time this afternoon for detailed discussion of this presentation but let's take a few minutes now to see if there are any questions for the previous speakers or clarification that's needed.

            DR. STARK:  I have a few questions.  Other panelist, please jump in.  Dr. O'Shaughnessy, thank you.  By the way, it was a fabulous presentation. 

            DR. O'SHAUGHNESSY:  Thank you.

            DR. STARK:  Very interesting subject and I think everyone is interested in seeing this technology succeed.  Certainly I am so forgive me.  Some of my questions are, I guess, by nature going to be -- are intended to be challenging.

            Mr. Miller talked about, as the panel did, what the word significant -- he used the term significance is a very loaded term.  Later on when we discuss the marketing materials and things like that, I'm worried about the pressures on radiologists to buy and use a technology and want to shift the significance to what really is clinically significant.            In your presentation you pointed out -- I believe several of your experts pointed out that the real clinical problem is that we're missing about 24 percent of nodules or we are missing nodules at a significant rate.  I think it was something like 24 percent or something, perhaps you can refresh me, were seen in retrospect.

            One significant figure of merit here would be what fraction of those nodules that are missed, that 24 percent that are detectable in retrospect, are now detected with this technology given that the technology by itself has a sensitivity of about 50 percent for detecting majority and unanimous nodules and a 50 percent detection rate?  I'm just asking.  It's very, very low. 

            That would suggest to me that at best the technology is going to reduce that 24 percent missed rate to about a 12 percent missed rate at the cost of generating 100 percent false positives and then having a radiologist groom through and sort all this out by basically being said, "Do it again." 

            I'm wonder if we had a placebo in this FDA trial of, "Radiologist, just do it again, " or, "Here is the sugar pill.  Just read it again," would we achieve the same presumptive 50 percent improvement in finding half of the lesions we know the current standard of care is to miss?

            DR. O'SHAUGHNESSY:  Right.  I would like to answer that sort of in two parts.  The first part I would like Dr. Miller to go over what we measured in our study and then have Dr. Castellino talk about translating that to the clinical environment if that's okay.

            MR. MILLER:  I guess there were a number of questions there.  Is there one you would like for me to start out with?

            DR. STARK:  I think you will do a great job.

            MR. MILLER:  Okay.  So the analyses that I showed at the end with the percent reduction in misses are sort of approximated percent reduction in misses where an attempt to get at that very issue.  I suppose that it is to some degree your job and, to some degree, our job to determine what is clinically significant.

            Now, the numbers that I showed you were sort of in the range of a percent reduction in misses of somewhere close to 20 percent.  Actually more like 20 percent on the low end.  That is similar to what the experience has been with CAD for mammography. 

            For CAD in mammography the percent reduction in misses has been in that range.  I think if you are a person that's affected -- I guess I'm drifting off from statistics here.  I should have handed it over to a clinician but, I mean, my hunch is that is a number that would be meaningful.

            As far as the stand-alone sensitivity, I do want to sort of bring us back to the fact that we evaluated two modalities here.  The two modalities that we evaluated were the readers stand-alone performance and the reader plus CAD.  The whole MRMC framework is developed around those particular modalities. 

            CAD as a stand-alone modality is not something that anybody is recommending that people use.  Therefore, those stand-alone numbers, I think, are less valuable but are more valuable if they pick up some of the more important things. 

            Also I think some of those things in the 4 to 10 millimeter range that readers react to and say, "Oh, I missed that.  I'm glad CAD pointed out."  It's more about what did CAD find than it is about exactly what the percentage is.

            DR. STARK:  Did you answer the core question of if the radiologist right now standard of care I would suggest, and clinicians can debate this, is that we miss a quarter of the lesions that are actually there in retrospect.  If we can accept that as a statement, then as you design the experiment, what data are there to suggest we would cut that miss rate and by how much?

            MR. MILLER:  Will you permit me to go back to the slide?  Sorry.  I'll get there soon.  Okay.  This, again, is presented as an analyses that was specified in the protocol that we would do, but you don't have competence intervals there so these are numbers that you would want to put competence intervals on if you were going to put a lot of weight behind them. 

            Also, they make the presumption that readers all read with the same threshold cutoff and we know that's not the case.  At a threshold cutoff of 50, let's focus on 50 for just a second, there were 228 missed quadrants.  In other words, out of the total number of quadrants that the radiologist looked at, 75 positive quadrants times 15 so there are 1,125 times that one of the readers looked at a positive quadrant. 

            They gave a rating less than fifty 20 percent of the time.  That is actually kind of a nice number because that number is not radically different from I think what we see in the literature.  It may be a little bit lower.  I think there's a little bit of a relaxed environment in the readings that they may be a little bit more likely to identify things.  But 20 percent of the quadrant something is missed.

            Post-CAD it goes to 16 percent so that's a 22 percent reduction in the misses.  That is, I think, the number that is closest to answering the question that you raised.  Is that correct?

            DR. STARK:  I think so.  Let me see if I understand it and then I'll ask you about the affect on this analysis of the quadrant versus the lesion methodology.

            MR. MILLER:  Okay.

            DR. STARK:  I think that prejudice thinks in favor of the technology.  I'm not sure.  So you're saying if the standard of care currently is to miss a quarter of lesions, then of that 25 percent we'll miss one-fifth less so now we'll miss 20 percent of the lesions.

            MR. MILLER:  Yes.  Their miss is defined loosely as you are not actioning a nodule that a consensus panel believes should be actioned.  I don't think that they are actually missing it in every case.  Sometimes they are giving it a low rating.

            DR. STARK:  Correct.  But as far as --

            MR. MILLER:  Yeah.

            DR. STARK:  You can debate the inference but the literature talks about a missed rated of 25 percent which we are going to equate with actionable nodules.  As we talk about the parent efficacy of this, and I appreciate your honesty, is that we are taking a standard of care of a 25 percent missed rate that juries and patients think is horrible in retrospect and we are going to cut that to a 20 percent missed rate.  We can judge the -- that's the efficacy.

            MR. MILLER:  I should also add this is just based on jumping from one 50 to the other 50 on the curve.  We did another set of analyses based on what happens if you jump from 50 to the other point on the curve where you -- I'm sorry. 

            I should say jump from 20 from one point on the curve to the other point with the same PBD and jump from 20 to the same point without sacrificing the false positive fraction.  That also was a protocol specified analysis and the numbers go down a little bit.  I don't remember how much but it may be five or 10 percentage points.

            DR. CONANT:  May I interrupt or just jump in for a second because you are the slide that I'm curious about.  You mentioned it's similar to mammography.  It is but it's so different.  I'm very interested in the by-case analysis of this compared to by quadrant.  The reason being I think you have a little bias in your case selection and I'm not sure if that is okay or not. 

            You have the majority of your cases, 62 percent of the nodule present cases, as people with extra-thoracic disease.  I'm not sure I really care about the absolute number of quadrants you've missed because once you've got three nodules in both lung fields, who really cares?  It's metastatic disease so I would want to see these numbers by case. 

            I also think the comparison to mammography is very different because I think that, again, chest analysis is much more multi-focal and reflective of systemic disease than mammography in terms of a bilateral fairly somewhat independent process.  I would just like your comments on that if you could take this another step and then do it by case.

            MR. MILLER:  We did not do these analyses by case.  I suppose the data are there to do it.  I think the challenge with doing it by case is that the way -- I should let a physician get up here in just a second but the way that one would action a case where you had one lung where you had a very high likelihood of it being something bad, using my simple statistical language, and you had the contralateral lung where you had something that was probably bad.  That one that's probably bad may actually be the one that drives the care of the patient. 

            Figuring out how you sort of wrap this all up and do something like this at the patient level with something that was sort of beyond the scope of what I was able to imagine.  I absolutely do not disagree that it's something that would be useful to try to investigate in some way.  Having said that, I think I really need a physician to answer the question.

            DR. CONANT:  I'm not sure what the answer is, though.  However, in your cases it's very different if a person -- if you're looking for a primary lung carcinoma versus metastatic disease so they are very different clinical questions.

            MR. MILLER:  Yes.  Let me let Dr. Castellino answer that.

            DR. CASTELLINO:  I'm not going to answer any statistical questions.  I can guarantee you that.  It is hard to answer that question.  I would like to put it more in a clinical context of how we read cases every day. 

            I agree that if you have a patient with a soft-tissue sarcoma and you find three, four, five nodules, unless you are in a setting where you have surgeons who aggressively pursue that, as I was at Sloan-Kettering, at times it is important to find a six or seventh nodule.  There is a spectrum of surgical behavior. 

            Let's assume that you find six or seven you don't have to find the last three.  We had very few cases like that.  The second thing is that we are not positioning this product as a lung cancer detection product, although it does work that way.  Patients with lung cancer who had a nodule, it was not necessarily the primary lung cancer.  They may have had lung cancer before treated post-op, post-radiation. 

            We accepted those cases and had a lung nodule in the lung for whatever reason so it wasn't really as a primary detection issue.  I'm not sure I answered that completely and I do recognize that certain mammography is quite different, as I think we have discussed before, than chest CT.

            I would like to go back to a couple of comments you made.  If I understood you correctly, I think you said, Dr. Stark, that the issue was that we had a 50 percent sensitivity for consensus nodules.  As I recall from looking at that, I think, with consensus we were closer to 80 or 83 with the classic nodule definition.  I'm looking at the -- you'll see that later with Petrick.

            If you stratify those nodules with what would be more definition that radiologists would call classic nodule.  It ranges from 83 to 59 I think is the number.  Is that correct?

            DR. STARK:  We can study it but I'm trying to draw data from table 10.  When I suggested 50 percent, it was based on this so maybe over lunch you can --

            DR. CASTELLINO:  We can go through it.  I thought it was about 59.  But I think it's a good point.  We would love to have developed an algorithm, to be very honest, that was 100 percent sensitive but this is the best we've come up so far.  I think the issue to me as a clinical radiologist is how would this affect me or my colleagues in practice to find more nodules that we look at a year later and say, "My goodness.  How did I miss that?  Why did I miss that?"         The ROC study, to some extent, I think, approaches that.  I think this table here to some extent also would address that.  These are nodules potentially that could be missed or are missed that the radiologist would say, "I would have liked to have seen that nodule to make a decision as to whether or not it's actionable or not."  I don't know if I'm addressing the myriad of questions that you had but I would like to try to -- if you can rephrase some of them I would like to try to answer them.

            DR. STARK:  If the chair and the panel think we have time.

            DR. IBBOTT:  Let's wait until after lunch and we'll have that detailed discussion this afternoon.

            DR. CASTELLINO:  Can you write them out so I can think about them?

            DR. STARK:  I'm not sure of the protocol.  I'll ask for advice.

            DR. IBBOTT:  I don't think there is any reason why you shouldn't present those questions and let them think about them over lunch.

            DR. CASTELLINO:  That would be very helpful because they are a lot and I think they are important questions.  Thank you.

            DR. IBBOTT:  Again, I'll take this opportunity to ask Dr. Mehta if he has any questions that require clarification at this point.

            DR. MEHTA:  No, I don't.

            DR. IBBOTT:  All right.  Thank you.

            DR. SOLOMON:  Do we have time for anymore questions?

            DR. IBBOTT:  Well, certainly.  Especially if it's appropriate now to get clarification on something before we break.

            DR. SOLOMON:  I guess I have a couple of questions for Dr. Delgado.  I guess they start off by asking you a little bit more about what your experience was with the system and then, more specifically, did you find that you as a radiologist or any of your colleagues were using the CAD system  or becoming more dependent on the CAD system and not quite giving it the same kind of read that you would give ordinarily?  Also, what was the impact on the time that you spent on a case?  Did it make it longer or shorter?  Why don't you answer those.

            DR. DELGADO:  Okay.  Thank you.  I think those are good questions.  First of all, we did not do any time analysis with and without CAD or separate, just soft-copy interpretation and then soft-copy interpretation without CAD and then subsequently with CAD. 

            I think it goes to say that if you are doing the second review that there might be a time factor that would be slightly increased and that may be something to be quantified.  However, in my experience I think, first of all, the first question is people were instructed through the training phase that this device was to be utilized through a primary read in which you make decisions on whether you see or detect a lesion and then there is a way for you to mark it.  Then you activate the CAD and then you go through, as Dr. Castellino said, really not the whole entire study again but only those images that identified a lung nodule.  It might be on average three per case or so where you might click on a button and that would take you immediately to that axial's image and show you a lesion of which then the radiologist would make a decision, "Did I miss this?  Is this a significant mark that I would consider actionable?" 

            Or, if not, then easily discharge and be done with it.  If it was a mark that is consider a false positive, that would be discarded easily.  I think we did have a few of our radiologist which initially asked the question, "Well, is this benign or malignant?" 

            Yet, we made sure and I as the principle doctor in charge of this made sure to remind them that this was not the purpose of this device.  It's really only to present you with a nodule that you may have missed and give you the ability to either add that to your findings or completely discard it.  Does that answer your question perhaps?

            DR. KRUPINSKI:  This will probably be more for Dave.  On point of clarification, you've got a quadrant and suppose the CAD during the initial view the reader says there's nothing there.  There really is a nodule and then the CAD comes up and points out the nodule and a false positive. 

            Now the reader increases their confidence and now do you consider that in the analysis and how can you be sure?  Do you consider that a true positive and an increase in behavior when, in fact, the radiologist was looking at the false positive?  Is there anyway without localization to establish that?       If you were then to take your cases and throw away any instances where the CAD marked a true and a false positive and the reader went from "false negative to true positive" what then happens to the ROC curves?  Admittedly, although you've got statistical significance, those curves are pretty darn close and you've got these ambiguous cases now.  How do you deal with that?

            MR. MILLER:  Well, the short answer is that we don't know precisely what happens in those instances.  It was not captured.  Bob Wagner talked about this best of both worlds scenario.  We really tried in the way that we did the study not to take the readers out of their normal reading environment. 

            We felt that was very important and so capturing additional data was something that we thought could take them outside of their reading environment and create some kind of placebo effect essentially.  We don't have that data on which one of the nodules or which one of the findings, I should say, which one of the CAD marks they are reacting to.

            Now, having said that, we did after we completed the ANOVA-after-jackknife analysis you can pull out from that analysis which cases are the ones that were most favorable in terms of producing a CAD effect and which cases are least favorable in terms of producing a CAD worse effect. 

            I sat down with a dozen or so of those cases with Ron Castellino, our chief medical officer, and went through them and said, "Is it obvious what they're reacting to here?"  In the overwhelming majority of the cases it was obvious what they were reacting to. 

            The number of marks per case is small enough that it is fairly unlikely -- I should say fairly.  The case where you have multiple close to positive findings in a quadrant is not very common.  It's common to have two in a quadrant but most of the false marks are very easily dismissable. 

            I mean, our engineers hate it when I say this but there are some vessels.  I mean, not a statistician I look at it and I say, "That's a vessel."  So the radiologist, it's really easy for them to dismiss those. 

            I guess the short answer is we did not do the analysis that you are suggesting but I completely take your point that it's important to figure out what was really going on in the ratings.  I think I have a pretty good feel for it that they were reacting to true positives.

            DR. KRUPINSKI:  So you rate them all as true positives?

            MR. MILLER:  Yeah.  I mean, the only thing that -- I mean, just from a programming perspective, the only thing that is fed into the analysis is the truth for the quadrants and the ratings.  Whether there were or were not CAD marks there is not actually in the analysis. 

            You could do an analysis that was more of a parametric model and a fixed effect model where you tried to capture whether it was the quadrants with CAD marks that were causing the increase, but I think it's reasonably obvious that they are in trying to model that it gets pretty messy building that on top of the models that we already did. 

            Just while I'm up here, I did really quickly want to comment on the issue about the sensitivity, the back and forth about that table.  I think you were doing a weighted average of some numbers in a table and we'll come back to that later, I think. 

            The sensitivity number -- I mean, it's just incredibly variable depending on sort of which reference truth you use and so if you hear different numbers going back and forth, it's not necessarily inconsistent.  Two people may actually be both reading sort of off the same page but in a slightly different spot on the page.  Thanks.

            DR. IBBOTT:  Thank you.  At this point Dr. Stark has a couple of questions he's going to raise now to be discussed later this afternoon.

            DR. STARK:  Actually, it's a response to Dr. Castellino's question which I respect and it's fair.  I have been working very, very hard for this because, as we'll discuss later, I have spent 15 years wondering why my ROC based prediction that MRI for detection of liver cancer in 1985 was significantly better than CT.  That was wrong.  I think I know why and I think this group here, the industry group and the panel, I think, were at the nub of it.

            Dr. Castellino, rather than have us giving the formality and the importance of this scratching on pieces of paper, I've asked the chair to allow me to read.  I've formed a question and I'm going to read it into the record and I'll give you my handwritten copy of what I'm going to read just so that we're clear on this.  Forgive me.  You've seen me scrambling over three minutes here.  If any of this is unclear, I'll rephrase it.  Thank you for offering to do this.         Would you please calculate from the data and/or literature discussed or presented here today, and in your submission, the net decrease in false negative rate which we have here today estimated to be 24 percent for practicing radiologists working by themselves when those radiologists in the future, we're projecting, are to add this technology and these results, these data to their practice, specifically accounting for what Dr. Conant was just asking about, accounting for and not crediting as a detection or improvement with the addition of CAD those quadrants or patients as you compile the data where CAD marked a false positive lesion in a quadrant where the radiologist alone had a false negative. 

            Where that radiologist, in other words, failed to recognize a true lesion false negative for the radiologist that was not subsequently marked by the CAD. 

            I have this written down.  I think that translates into English and I would be happy to clarify.  Feel free to grab me during lunch if there is some nuisance of that that would make a better question.

            DR. IBBOTT:  All right.  Thank you.  At this point then, we'll call this session to a close and break for lunch and we will reconvene at 1:15, just a little less than an hour.  Thank you.

            (Whereupon, at 12:21 p.m. off the record until 1:18 p.m.)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

         A-F-T-E-R-N-O-O-N  S-E-S-S-I-O-N

                                         1:18 p.m.

            DR. IBBOTT:  Could I get you to take your seats, please, and we'll continue.  Thank you.  I would like now to call the meeting back to order and I would like to remind public observers of the meeting that while this portion of the meeting is open to public observation, public attendees may not participate unless specifically requested to do so by the chair.  At this point Mr. Doyle has a statement to make.

            DR. DOYLE:  Yes.  The R2 has approached me and indicated that they have developed answers to the questions that Dr. Stark proposed at the end of the morning session.  In an effort to keep the meeting moving with the schedule we have, I have asked them to present those answers at the beginning of the discussion section this afternoon.  They have the answers ready and I would just ask for the flow of the meeting to present those at that time.  Thank you.

            DR. IBBOTT:  Thank you.  We will now continue with the FDA's presentation on this PMA which will be introduced by Dr. Phillips.

            Dr. Phillips.

            DR. PHILLIPS:  Well, in case you forgot what we're doing over lunch, we are discussing the image checker CT CAD by R2 Technology.  It is a system that analyzes and displays to assist radiologists in review of multi-slice CT exams to the chest and in the detection of solid pulmonary tumors. 

            It is composed of several items.  It's a combination of software and a computer.  The system is a work station which is the image checker CT Model LN-500.  This was approved for marketing under a 510(k) K023003, the software which is the operating system for the product that we are looking at today.

            Again, the indications for use, and I don't need to read those.  Then this was reviewed within FDA by a rather extensive team.  Michael Kuchinski was the team leader; William Sacks was the clinical reviewer; Teng Weng was the statistics reviewer; Robert Wagner and Nicholas Petrick were reviewed for analysis methodology; Joseph Jorgens reviewed the software; Larry Stevens did bioresearch monitoring; Fleadia Farrah did the manufacturing.  That's the quality systems regulation; and Ronald Kaczmarek reviewed it from epidemiological basis.

            Two people will present to you today, Bill Sacks and Nicholas Petrick, discussing the PMA.  The other reviews were all found to be satisfactory and we are moving on from there.

            With that, Bill Sacks.

            DR. SACKS:  I apologize for the jaundiced look of that.  It wasn't so bad in the rooms we were testing this in.  Okay.  I'm going to just give some background.  Then Nick Petrick will present the data from the clinical study and then I'll come back and draw some conclusions.

            The outline of my introductory comments, I'll say something about the character of the device for those of you who did in fact, forget over lunch something about the clinical utility, a point about the instructions for use, and some issues that are new to this particular PMA.

            First on the character of the device.  Just to remind you, this is for chest CT scans and for CTs that are done for any indication the algorithm is trained to detect solid lung nodules, not, for example, ground glass opacities.  It is trained to detect nodules between 4 and 30 mm. 

            Also there was a Hounsfield unit cutoff which is just CT numbers, the amount of radiographic attenuation that needs to be above -100.  In particular, this is a computer-aided detector.  Just to say a word about the difference between computer-aided detection and computer-aided diagnosis, a point I made earlier.

            The difference between detection and discrimination lies not in the instrument but in the clinical use to which it's being put.  The detector system, which is what we're talking about today, this left-hand column, scans entire images whereas a discriminator only scans portions that are selected by the user.  The detector marks the images where a discriminator will give a level of suspicion that is just a number.  As I say, the same device will do both but it is thresholded to give you marks when it's acting as a detector.

            On clinical utility, as we've heard, many nodules are missed in clinical practice for two major reasons.  One, other pathology distracts and hundreds of images are present in one CT of the chest.  Indeed, you may start out as a board certified radiologist and after reading 500 images you are certified board.

            A CAD is intended to reduce the missed nodules, this CAD.  That is, it is intended to increase the users sensitivity to detecting lung nodules.  We will come back to this point.

            Instructions for use.  The important points are that the reader should review the films unaided first.  Then the CAD marks the candidate nodules.  Then the reader looks again in the vicinity of those marks. 

            If the CAD fails to mark a nodule that was judged actionable on the initial unaided review, the instruction in the labeling reads that the reader should retain that initial judgment, not back off just because the CAD failed to mark it.  We will come back to this in my closing comments.

            Issues that are new to this PMA are should the particular choice of target for the CAD algorithm, the definition of truth, the unit of analysis and endpoints.  I'll say something about each of those.

            First, on the CAD target, the target is not malignant nodules but actionable nodules as we've heard which, among other things, means that the definition of truth is not based on biopsy or tissue histology which would be an external standard, but rather based on the judgment of an expert panel that is an internal standard based on the very images that are being evaluated here.

            The unit of analysis, as we've seen, at one level of the statistical unit is the person but it's further broken down into long quadrants and Nick Petrick will say more about that.

            Finally, the end points.  One could do an entire ROC curves as was done and one could, as Bob Wagner explained this morning, in addition, or instead of, do the sensitivity and specificity of a particular action recommendation which was not, in fact, done in this particular study.

            In summary, again, just to remind you, the clinical study consisted of three expert radiologists drawn from a group of 11 but three at a time on a panel to determine what was called by the company reference truth for each nodule.  Then there were 15 completely different radiologists with a range of experience, not necessarily experts, that were called the readers and they all 15 read all 90 cases and the 90 subjects were divided into 360 long quadrants.  Those 15 readers used a 100 point scale for a confidence and actionability rating for each case.

            Now I'll introduce Nick Petrick who will give you the clinical data.

            DR. PETRICK:  Okay.  So my name is Nick Petrick and I will go through -- let me see which one of these work.  I'll go through the clinical results that were done by the sponsor and some of our perspective.  The outline of my talk will be first to talk about the applicability of Az in the analysis.  Here I'm using the term Az which is somewhat more of a technical term but this is the same as the area under the curve or AUC.  Other people may call it area under the curve or AUC but I'm going to use that as meaning the same thing here.

            I will also talk about and somewhat review what the sponsor presented on the pool of cases used for the clinical study.  I'll talk about the definition of actionable nodules by the panel of experts.  Then I'll go into the particulars of the clinical study. 

            In particular, I'll talk about the primary analysis which was analysis using a fixed panel of experts and then what is somewhat of importance here, the secondary analysis which was the analysis using random panels of experts. 

            Then I'll finish up my presentation by talking about the measurement of CAD stand-alone performance.  When I'm talking about stand-alone performances this is the algorithm performance with no reader involvement.

            Okay.  So for the applicability of the agency here, I show one of the sponsor's curves for the average reader ROC from predisposed CAD and this had a change in the area under the curve of .024 and a p-value as shown there .003.

            What's important to note about the applicability of the Az is that degree in curve here is the pre-CAD and the reddish curve is the post-CAD.  And what we're looking for is that the two curves don't cross.  That is an important measure if we are going to use Az as an overall performance measure for ROC analysis.  What we find from this average curve is that generally the post-CAD curve is higher or on the same order as the pre-CAD curve.

            So just to summarize this, the pre-imposed CAD curves did not cross in the average performance I showed before.  I think, more importantly, there was no substantial pre or post-CAD crossing in either the average or individual ROC curves.  This is important.          That makes the Az statistically appropriate performance measure for this type of analysis.  If they had a significant crossing, we would have had to look at some sort of partial area or some other measure of performance in that situation.  Because of this conclusion the sponsor had used an Az as a figure of merit in all their analysis that follows.

            Okay.  Now to talk about the pool of readers.  Again, just sort of a summary of what the sponsor had talked about before.  There is a pool of cases.  There was a subset of that which was made of nodule cases.  These were documented cancer cases so the primary neoplasm or extra-thoracic neoplasm with presumptive spread to the lungs.  That is the set of nodule cases.  The cases were allowed to contain non-nodule pathologic processes, things like pneumonia or emphysema and so forth were allowed to be part of that subgroup.

            They took another set of cases.  These were considered the non-nodule cases and what they term or what can be termed as normal cases where there was no nodule deemed present by the site PI and that site PI primarily relied upon original radiology reports in coming to that determination.

            These cases could include a history of cancer, radiation therapy, or even previous thoracotomy were allowed to be in this data set.  This is a pool of cases that now the sponsor will pull out cases to run their ROC reading studies from.

            At this point we're not going to talk about -- we are going to talk about actionable nodules or the object of interest in this application.  In particular, there is a panel of expert radiologists that identified the actionable nodules.  This was done in a two-stage process, again, just as a review as before.

            In the first reading the cases were independent and blinded by three expert radiologists.  The information provided to the radiologists were the subject's age, gender, and indication for the exam, obviously along with the exam as well.

            Each individual radiologist marked all findings deemed to be lung nodules.  Then the radiologist provided ratings for each of those nodules so there is a detection test and then there's a rating of the actionability of that nodule.  It could have fallen into an interventional category.  That is an actionable finding where further workup was advised.

            A surveillance which is, again, considered an actionable finding which was monitored with follow-up studies and this would probably be more typically additional CTs.  Also, they could have rated as probably benign calcified.  Again, no action required here, or probably benign noncalcified, no action required.

            After the first pass was done, findings that lack 100 percent consensus after that first pass were reviewed unblinded by all three radiologists and basically they are going to reevaluate locations where either two out of three of the panel or one out of three of the panel call the location a nodule.  then the radiologist would rate or rerate these on the actionability of the nodule candidates.

            Along with this thresholding was applied to match what the general performance of the area where the algorithms should be performing and so thresholds of greater than 4 mm. in diameter for each nodule candidate and a peak density of greater than -100 Hounsfield units.  This considers a CT number and is related to the attenuation coefficient in grayscales in the CT exam.

            Then after each nodule was identified, each lung quadrant was categorized based on the highest actionable finding within that quadrant.  Then subsequently the quadrants will be used in the observer studies.

            Now, just to summarize what was found in that initial pass, again, this is three experts per panel.  I'll show in this column the unanimous actionable.  That's three out of three finding.  Majority actionable two out of three.  Minority actionable one out of three.  You can see that for unanimous actionable there was 142 findings.  For majority there were 168.  For minority there were 149 findings.

            This gives you somewhat of an indication that panel variability is an important component here.  There's a lot of cases, almost a third -- only about a third of the cases were unanimously actionable and another third or so were two out of three, and another third were one out of three.  This gave the FDA an indication that panel variability was an important component and probably should be taken into account in the clinical study.

            Now to go into the clinical study, there were multi-reader, multi-case ROC observer studies.  Again, the test statistic was the Az or area under the curve.  I'll present net results based on analysis of 90 case data set, 360 quadrants.  The sponsor also performed a 32-case study and also presented pooled results of the 32 and 90 cases.  I'll just limit myself to the 90-case study.

            What's important the MRMC allows us to look at the variability, confidence intervals, and significance testing and we can take those into account.  That is important obviously in this case to determine significance and then to try to get an idea of what the separation is between the reading without CAD and reading with the CAD device.

            In order to analyze the variability confidence intervals and significance two approaches were used, ANOVA-after-jackknife and bootstrap analysis.  So here is just the general flow chart to the clinical study and this will be followed for all the clinical studies.  The study starts out with a pool of readers.  These are going to be the group of radiologists that are going to actually read the cases and give rankings for each quadrant.

            There's a pool of cases and there's a pool of experts and the experts will be used to define truth.  There will be a sample pulled out of cases.  It will be used by the pool of experts to define nodules.  There will be a set of readers picked out.  Those cases will then be read using multi-reader multi-case ROC observer study and an estimate of the Az will be calculated.  This could then be redone for different case sets, different reader sets, and potentially different experts on a panel.

            So the important components here are how to measure the variability confidence intervals and do significance testing.  Again, two approaches were taken, ANOVA-after-jackknife analysis.  This is a parametric type of analysis and just jackknife if a leave one case out type of analysis. 

            Again, we're talking about leaving out a whole case so you're leaving out all four quadrants together and then performing a quadrant-based analysis on that.  So just as a quick example, if we had a case set of case one, two, and three, when jackknifing is performed or leave one case out, the first partition is going to be one and two.  We've left out case three.  The second partition may be set case one and three, case two has been left out. 

            Finally partition would be two and three leaving case one out.  Then using those partitions and looking at the pseudo values that come out of that you can use ANOVA to estimate the variability confidence intervals and significance.  The analysis assumes modality as a fixed effect and readers, cases, and all interactions as random effects in the ANOVA. 

            A second approach to doing this is bootstrap analysis and this becomes important to look at variability of the truth panel.  This is, again, just to repeat, is a nonparametric analysis.  What happens is randomly generated data sets are created based on the original data using replacement.  Just as another quick example, with a case set of one, two, and three again when you run bootstrap you use replacements of the first partition, randomly pick maybe case three, case two, and case three. 

            When you do the analysis you assume that case three and case three are really separate events and we bootstrap across those to get those potential partitions.  The second partition you may pick case three, case one and case two.  Here all the cases have shown up equally.  Then a third partition may be case one, case one, and case two and so forth.

            So the primary analysis, again, the same basic diagram as before but now there's a resampling scheme introduced into the analysis.  The resampling is used for the pool of readers, again, the people that are going to -- the radiologists that are going to rank the quadrants and the pool of cases.

            The truth is based on a fixed three-member nodule definition panel, again, based on unanimous consensus.  The analysis will be based on ANOVA-after-jackknife.  Also bootstrap analysis was also performed.  What happens here is a pool of readers go in.  It's resampled so it picks out a subset of readers.  Likewise a subset of cases is selected using a resampling scheme.  The cases go into the definition panel where the panel is fixed and define the actual nodules of interest or the quadrants that are positive or those that are negative. 

            The set of readers are then randomly selected and go in and perform the ROC experiment.  That gives one estimate of Az.  This process is repeated either through jackknife or bootstrapping in order to get estimates for the variability and allow for confidence intervals and significance testing.

            So just the result of the clinical study.  Again, this is for a fixed three-member nodule definition panel.  In the first column I show the pre-CAD Az for both jackknife and bootstrap.  The second column is post-CAD, the change in the Az, the p-value for that particular test, and the lower and upper confidence intervals.

            You can see that the results are fairly consistent between both jackknife and bootstrap with a pre-CAD Az of .881 or .879, post-CAD increasing to .905 or .903.  With change on the order of .024 we see fairly small p-values for both the jackknife and bootstrapping.  Then the confidence intervals also fairly consistent.

            We wouldn't necessarily expect the bootstrap and the ANOVA to give us the same values but it's nice actually to see that there is consistency here between the two analyses.

            So just some conclusions on the primary analysis.  The sponsor has shown a statistically significant improvement in Az from pre to post-CAD and that is on the order of .024 or change in area under the curve.

            The ANOVA-after-jackknife and bootstrap analysis showed consistent performance in both significance and confidence intervals.  The analysis, however, was limited because it did not take into account any variation in the expert panel.   Variability of the panel would add uncertainty to the performance estimates, or we anticipate that variability in the panel would add uncertainty to the performance estimates. 

            This is, I think, an important factor because we don't have this cold standard of truth.  We are dealing with a panel truth.  We expect if we sampled a new panel, they may come up with a different set of cases.  They certainly would come up with some different nodules there. 

            One of the important questions is how would performance change with a different panel makeup.  That is one of the questions that we had talked to the sponsor about addressing.  In particular, looking at a different number of panel members so if you have a different panel makeup or a different definition of truth potentially and different sets.  What happens if another set of experts was used.

            So a secondary analysis was conducted here.  I'll show there are many different types of analysis done by the sponsor.  I'll concentrate on one set of random panel makeup.  This will be based on a random three, two, or one-member panel, nodule definition panels and assuming the definition for truth is unanimous consensus. 

            Because of this type of analysis the ANOVA-after-jackknife isn't applicable at this point so only bootstrap analysis is possible.  It follows a similar scheme as before.  We, again, start with a pool of readers, pool of cases, pool of experts.  Here, however, bootstrapping is applied to the pool of experts as well so that we have a different panel makeup for defining truth.  That adds variability into that definition of truth and we can use our MRMC ROC observer study to take into account that variability.

            So we use bootstrapping to select a group of readers, a group of cases, and a group of experts.  Again, with that particular combination we get an estimate for Az.  That study is repeated a number of times to allow again to look at variability where we have included variability of the truth.

            So, again, these are random three, two, and one member nodule definition panels.  When I'm talking three-member panels I'm saying unanimous consensus.  Three out of three have to agree.  When I get results for two members that means two members. 

            They both have to agree.  Obviously for one-member panel it is the opinion of one of the members.  The sponsors randomly sampled that panel so that we get the added variability from having many different experts involved.

            Again, the same layout here.  The pre-CAD Az, the post-CAD, the change, the p-value, and the lower and upper confidence intervals.  We can see from pre-CAD this measurement of performance was .845 increasing to .868. 

            For the three-member random panel a change of .022.  For a two-member panel it was .832 increasing to .854, again a change of about .022.  One-member panel .817 increasing to .838.  Again, a change of about .0.  This is 21 but very similar 0.22 on average.

            We also see fairly consistent upper and lower confidence intervals for all different definitions of the truth.  Then we see the significance values which are fairly small as well.  That's sort of interesting because what I talked about before was that we expected when we incorporate randomness of the panel in here, we would see an increase or a decrease in the statistical significance that this would be a harder -- that it would be harder to chose statistical significance. 

            Really we see similar p-values to what we saw when we had a fixed-member panel.  One of the possibilities or one of the trade-offs that may have occurred was something that Dr. Wagner talked about this morning where when the definition of truth is varied, we have also varied the case mix or the differentiation between negative and positive findings so we have now moved ourselves potentially more off the curve where we have a more closer balance study which gives us effectively a larger number of cases or a larger number of effective cases. 

            That was traded off against the variation in the truth.  Those seem to potentially have traded each other off where we don't see a big difference in the performance.  This is one possibility.  It's certainly not conclusive in any way but it is somewhat surprising that we didn't see a larger variation in the truth when we randomize it.

            So just some conclusions on the secondary analysis.  This analysis take into account the random nature of the expert panel for defining actual nodules.  In particular, it took into account different number of panel members and different panel makeup using a bootstrap selection of the panel.

            All variations of the panel make up confirmed a statistically significant improvement in the Az from pre to post-CAD and this change was on the order of .02.  And just a more general conclusion, this type of analysis where we actually tried to randomize the panel makeup is likely to be a more appropriate type of analysis for assessment of devices when panel truth -- when only panel truth is available.  That's obviously the case here but we can anticipate other devices potentially coming in where this will again be an issue.

            Finally, I would like to talk about CAD stand-alone performance.  In particular, this is a performance of the CAD algorithm alone and it's the algorithm's sensitivity and specificity with no reader involvement so we are just going to measure the performance of the algorithm on some set of cases or defined nodules.

            Why may this be important?  Well, it's generally important because the radiologist can use this information to appropriately weigh their confidence in the CAD marking so this is a measure.  If you are a reader or a radiologist trying to purchase this device, you generally like to know how it would work.  Or if you have the device to use, to get a feel for how it's performing and what it might be marking.

            Likewise, it potentially can be used as a benchmark for future revisions of the algorithm so as an FDA perspective knowing some benchmark of performance may help us to determine how to evaluate new revisions of this particular algorithm when it comes in.

            The question becomes what's an appropriate performance measure for this particular device and this isn't necessarily an easy question to answer.  Anecdotally the sponsor went back and looked for the unanimous three out of three fixed-member panel and look at those on the appearance of the nodules that the radiologist marked.

            What they found was that many of those 142 findings did not meet the criteria of solid discrete spherical density.  They subsequently went back and reconvened a second panel to reevaluate the nodule but only based on appearance.  Not to find new nodules but just look at the appearance of those nodules defined.

            They put together a set of five independent radiologists and they were asked to categorize the nodules into two categories, either what they define as classic nodule.  These are discrete, solid, spherical ovoid nodules, or as nonclassic nodules.  These would be nodules that may not be discrete.  they may be hyperdense, irregular in shape.  They may be potentially normal structures that for whatever reason may not be considered nodules at all.  This new panel is only going to look at the appearance of the nodules and determine whether they are classic or nonclassic in appearance.

            This is a performance.  In the first column I'll show the number of panels defining the nodule as classic.  Again, there was a total of five.  I'll just group together zero, one, and two out of five.  I'll give the number of findings.  The true positive fraction, the sensitivity of the CAD algorithm to those particular subset of cases.

            In general I'll just summarize the CAD false marker rate.  Then I'll give a final column to the median diameter of the true positives detected.  This is just to give an idea if there is any bias on the size of the nodule based on how many panelist defined it as classic.

            So in the first category less than three out of five there was a total of about 65 findings.  The sensitivity was on the order of about 32 percent.  For three out of five there was a total of 13 findings, sensitivity of approximately 70 percent.  Four out of five of the panelists saying this is classic in appearance the performance jumps up to about 82 percent.  All five the performance is about 83 percent.

            If you just combined all these findings together a total, again, of 142 based on the definition of truth.  The sensitivity is on the order of about 59 percent.  The CAD false marker rate, it varied between two and three depending on whether the sponsor incorporated or didn't the equivocal nodule.       If you had a five-out-of-five rating, what you did with the zero, one, two, three, four out of fives whether you included those or not as false positives would change the median false marker rate but it's on the order of two or three per case.

            In the final column we see that this is a range of the diameter to those true positives.  You can see that it ranges from about eight to nine.  For the less than three out of five it was 7.4.  For three out of five it jumped up to about 11 and fell down to seven again.  The idea of this column is just to show there doesn't really seem to be a bias associated with how large the lesion was based on how they rated it as classic or not.

            Just as a final summary, if there was less than three out of five panelists, there was approximately 65 findings and the sensitivity was about 32 percent.  If it was greater than three out of five, there was about 77 findings.  This is about half and half -- relatively close to half and half for the data set.  The sensitivity jumped up to about 81 percent.

            So just in summary for the CAD stand-alone performance, what was found by the sponsor was there was a large variation in performance of the CAD based on the physician's assessment of the nodule's appearance as classic.  Whether it was classic or not would make a big difference on how well the CAD performed.

            Just a note, generally the CAD -- the sponsors talked about the CAD being associated with these discrete spherical types of lesions and not necessarily some of the other types of lesions that were potentially marked.

            So just in summary for this part of the presentation, what the sponsor found was that the -- what we found was that the Az was an appropriate test statistic for the clinical analysis and this was based on the fact that there was no substantial crossing of the pre and post-CAD ROC curves.

            The primary analysis, this was based on a fixed three-member expert panel.  It showed a statistically significant Az improvement in the detection with the CAD.  What was also found was the ANOVA-after-jackknife and bootstrap showed comparable significance testing and confidence intervals.

            The secondary analysis, this was with a variable number of panel members where the sponsor varied the number of panel members.  They also varied the panel makeup using a bootstrap selection of the panel members so this is a random panel mix now.  This confirms statistically significant Az improvement in the detection with CAD. 

            Then, finally, for this CAD stand-alone performance what was found was that there was a large variation in CAD performance based on the reassessment of the nodule's appearance.  A more general conclusion from stand-alone performances is that this type of analysis is necessary for appropriate utilization of the device by the clinicians in the field and for potentially reassessment of future algorithm revisions.

            Now I'll turn it over to Dr. Sacks again to make some conclusions.

            DR. SACKS:  Okay.  I want to then draw some clinical conclusions about this statistically significant gain.  Granting the statistical significance of a gain in Az of .02, what is the clinical significance and this is a point that was discussed somewhat this morning.

            Let me recall for you an earlier slide that I have excerpted this from.  That is, that the clinical utility of this device is that the CAD is intended to reduce the number of missed nodules.  That is, it is intended to increase the user's sensitivity, not increase the area under the curve, although that is related.

            A gain of .02 in Az understates the relative gain in sensitivity.  Why is that?  When the CAD is used according to instructions to retain all judgments of actionability, even if unmarked by the CAD, the user always necessarily maintains or increases his or her sensitivity and, indeed, always maintains or increases the false positive fraction as well.  They both have to go up.  They could stay the same but that would be an extreme case that wouldn't likely happen, but they cannot go down either one.

            What that means in ROC space is that -- let me walk you through this slide -- the blue curve is intended to be a representation of the unaided initial reading.  The red curve is the aided reading.  We've been talking about the difference in area between under the red curve and under the blue curve.

            But if you talk about a particular operating point on the blue curve unaided and ask what happens when you use the CAD, you move to some point on the red curve and if you obey those instructions not to back off when the CAD fails to mark something that you thought was actionable, you necessarily move up and to the right somewhere in that quadrant such as this arrow here so you move to some point here.

            Now, Dave Miller showed you a number of representative arrows if you were to use a particular point on the rating scale on the blue curve and keep that same point on the rating curve -- on the red curve, the same rating, 80 or 50 or 20. 

            The problem is that radiologists while they could read by assigning a number to a study and always obeying a preset range for themselves saying, "If I assign any case 70 or more, then I am always going to act on it the same way. 

            If I assign between 40 and 70, I'm always going to act on it the same way.  If I assign under 40, I'm always going to act on it in the same way," then those points might be relevant.  Radiologists could do that but I'm a radiologist and I can tell you radiologists don't do that. 

            What they do do is they look at a case and they decide, "Do I act on this or do I not?"  Or if there is a trichotomy such as in mammography where there is biopsy or short-term follow-up or return in a year for screening, that is the decision you make.  That gives you an