1
DEPARTMENT OF HEALTH AND HUMAN
SERVICES
FOOD AND DRUG
ADMINISTRATION
CENTER FOR DRUG EVALUATION AND
RESEARCH
JOINT MEETING OF
THE ARTHRITIS ADVISORY
COMMITTEE AND
THE DRUG SAFETY AND RISK
MANAGEMENT
ADVISORY COMMITTEE
VOLUME III
Hilton
2
P A R T I C I P A N T S
Alastair J. Wood, M.D., Chair
Arthritis Advisory Committee:
Allan Gibofsky, M.D., J.D.
Joan M. Bathon, M.D.
Dennis W. Boulware, M.D.
John J. Cush, M.D.
Gary Stuart Hoffman, M.D.
Norman T. Ilowite, M.D.
Susan M. Manzi, M.D., M.P.H.
Drug Safety and Risk Management Advisory
Committee:
Peter A. Gross, M.D.
Stephanie Y. Crawford, Ph.D., M.P.H.
Ruth S. Day, Ph.D.
Curt D. Furberg, M.D., Ph.D.
Jacqueline S. Gardner, Ph.D., M.P.H.
Eric S. Holmboe, M.D.
Arthur A. Levin, M.P.H., Consumer
Representative
Louis A. Morris, Ph.D.
Richard Platt, M.D., M.Sc.
Robyn S. Shapiro, J.D.
Annette Stemhagen, Dr.PH. Industry
Representative
FDA Consultants:
Steven Abramson, M.D.
Ralph B. D'Agostino, Ph.D.
Robert H. Dworkin, Ph.D.
John T. Farrar, M.D.
Leona M. Malone, L.C.S.W., Patient
Representative
Thomas Fleming, Ph.D.
Charles H. Hennekens,
M.D.
Steven
Nissen, M.D.
Emil
Paganini, M.D., FACP, FRCP
Steven L. Shafer, M.D.
National Institutes of Health
Participants
(Voting):
Richard O. Cannon, III, M.D.
Michael J. Domanski, M.D.
3
P A R T I C I P A N T S (Continued)
Guest Speakers
(Non-Voting):
Garret A. FitzGerald, M.D.
Ernest Hawk, M.D., M.P.H.
Bernard Levin, M.D.
FDA Participants:
Jonca Bull, M.D.
David Graham, M.D., M.P.H.
Brian Harvey, M.D.
John Jenkins, M.D., F.C.C.P.
Sandy Kweder, M.D.
Robert O'Neill, Ph.D.
Joel Schiffenbauer, M.D.
Paul
Seligman, M.D.
Robert Temple, M.D.
Anne
Trontell, M.D., M.P.H.
Lourdes
Villalba, M.D.
James Witter, M.D., Ph.D.
Steve Galson, M.D.
Kimberly Littleton Topper, M.S.,
Executive
Secretary
4
C O N T E N T S
Call to Order:
Alastair J. Wood, M.D. 5
Conflict of Interest Statement:
Kimberly Littleton Topper,
M.S. 5
Naproxen
Investigator Presentation
Alzheimer Prevention Study: ADAPT
(Alzheimer's Disease
Anti-Inflammatory
Prevention Trial):
Constantine Lyketsos, M.D. 14
Additional Background Presentations
Interpretation of Observed
Differences
in the Frequency of Events When the
Number of Events is Small:
Milton Packer, M.D. 42
Clinical Trial Design and Patient Safety:
Future Directions for COX-2 Selective
NSAIDS
Robert Temple, M.D. 95
Issues in Projecting Increased Risk of
Cardiovascular Events to the Exposed Population
Robert O'Neill, Ph.D. 109
Summary of Meeting Presentations:
Sharon Hertz, M.D. 132
Sponsor Responses 140
Advisory Committee Discussion of
Questions 147
Question
1: 165
Question 2: 284
Question 3: 320
Question 4: 356
Question 5: 367
Question
6: 391
Question 8: 418
Question 7: 432
Meeting Wrap-up 438
5
P R O C E E D I N G S
Call to Order
DR. WOOD: Let's get started. This is our
third day and thanks to everybody for
coming back.
We have obviously entertained you
sufficiently.
Kimberly has a statement to
read.
Conflict of Interest
Statement
MS. TOPPER: The following announcement
addresses the issue of conflict of
interest with
respect to this meeting and is made a
part of the
record to preclude even the appearance of
such.
Based on the agenda, it has been
determined that
the topics of today's meeting are issues
of broad
applicability and there are no products
being
approved.
Unlike issues before a
committee in which
a particular product is discussed, issues
of
broader applicability involve many
industry
sponsors in academic institutions. All special
government employees have been screened
for their
financial interests as they may apply to
the
general topics at hand.
To determine if an conflict of
interest
existed, the agency has reviewed the
agenda and all
relevant financial interests reported by
the
6
meeting participants. The Food and Drug
Administration has granted general-matter
waivers
to the special government employees
participating
in this meeting who require a waiver
under Title
18,
waiver statements may be obtained by
submitting a
written request to the agency's Freedom
of
Information Office, Room 12A-30, of the
Parklawn
Building.
Because general topics impact
so many
entities, it is not practical to recite
all
potential conflicts of interest as they
apply to
each member, consultant and guest
speaker. FDA
acknowledges that there may be some
potential
conflicts of interest but, because of the
general
nature of the discussions before the
committee,
these potential conflicts are mitigated.
With respect to the FDA's
invited industry
representatives, we would like to
disclose that Dr.
7
Annette Stemhagen is participating in
this meeting
as a non-voting industry representative
on behalf
of regulated industry. Dr. Stemhagen's role on
this committee is to represent industry
interests
in general and not any one particular
company. Dr.
Stemhagen is Vice President of Strategic
Develop
Services for Covance Periapproval Services,
Inc.
In the event that the
discussions involve
any other products or firms not already
on the
agenda for which an FDA participant has a
financial
interest, the participants' involvement
and their
exclusion will be noted for the record.
With respect to all other
participants, we
ask, in the interest of fairness, that
they address
any current or previous financial
involvement with
any
firm whose product they may wish to
comment
upon.
There is one administrative
announcement.
Would you please make sure that you take
your phone
calls outside. It is messing up with our audio and
we would really appreciate it. Thank you.
DR. WOOD: The other administrative thing
8
that the sound person has asked me to say
is, to
the committee, try and remember to switch
off your
microphones when you are not using them.
Apparently, it messes it up.
MR. LEVIN: Mr. Chairman?
DR. WOOD: Yes, Arthur?
MR. LEVIN: I wanted to express a concern
I have in terms of the agenda for today's
meeting.
For those of us who have been at advisory
committee
meetings before, we know that there is
often a
tendency to sort of squeeze the most
important part
of these advisory committee meetings
which is the
discussion and answers to the questions
and giving
directions to FDA.
My concern is that, given the
lengthy
discussions we have had over the past two
days and,
given the fact that this is last day,
that we will
not have enough time to fully explore all
of the
questions that have been raised over the
last two
days and to give some definite direction
to the FDA
as to how to pursue these issues.
So I would like to suggest to
the group
9
that we might shorten the presentations,
or
eliminate them entirely, in order to have
adequate
time to fully discuss all of our concerns
and
different points of view around the
table. I think
it would be really unacceptable to leave
here today
unable, because of a time constraint, to
give
direction to the FDA on this issue.
DR. WOOD: Did you have any particular
people you wanted to eliminate? Or do you want to
pass me a note, privately?
MR. LEVIN: It may be something the
committee as a whole should decide.
DR. WOOD: Let me make a suggestion. I
think that is a reasonable approach. I am sure the
committee will want to hear the data from
the ADAPT
study and we should hear that in its
totality.
Milt Packer has come a long way so we
should hear
from him, I think. Milt is always entertaining,
anyway.
Do we really need to hear from
the two
Bobs?
DR. TEMPLE: I don't have any ego involved
10
in this.
A fair amount of--some of what I am
talking about is about the adverse
consequences of
blood-pressure elevation which I think I
could
skip.
So I could shorten it considerably.
But you
guys decide. It is there for you to read if you
want.
DR. WOOD: Why don't you do this. Why
don't you distribute your talk to us.
DR. TEMPLE: I think it has been.
DR. WOOD: Right; I understand that. I
will take that as a given. And both of you make
whatever remarks you would like to make
from your
seats there at the times that you are
allotted, but
brief and pointed. And let's not revisit all the
things we have visited before.
DR. TEMPLE: That's fine.
DR. WOOD: Does that sound fair? Dr.
O'Neill?
DR. O'NEILL: Yes; that is fine.
DR. WOOD: That will save us some time.
So that is a good thought. In addition, we have
got Sharon Hertz's talk which, I notice,
has
11
40-something slides here--45
slides--which is a lot
to get through in a few minutes. So I think, while
we are sort of working up to that, she
may want to
look at that and decide what she really
needs to
say.
I mean, after all, it is very unusual for the
FDA to summarize the meeting for the
committee,
which is partly what the committee is
here to do, I
guess.
So let's make sure that she can
finish
that taking the time she has been allotted for it
which is 30 minutes. She would be better to remove
some slides rather than rush through it,
I think.
Having said all that, let's get
to the
first presentation. Does anyone else have any
thoughts on that? Yes, Annette?
DR. STEMHAGEN: I would like to ask
whether the manufacturers could have just
one or
two minutes to make some summary comments
before we
start our deliberations after lunch.
DR. WOOD: Do they want to do that now?
Is that what you are asking?
DR. STEMHAGEN: No; I think after these
12
presentations.
DR. WOOD: Okay.
DR. STEMHAGEN: Thank you.
I appreciate
it.
DR. WOOD: Let's have some discussion
amongst the committee.
DR.
their having--they have had lots of time
already to
present their data and had lots of mike
time in the
back already.
DR. STEMHAGEN: Just in terms of the
deliberations that have gone on, there
might be
some clarifying comments.
DR.
we can ask for clarifying comments. I think that
is
what we--I would suggest--and I agree with
Arthur Levin in that we should get on to
discussion
as quickly as possible.
DR. STEMHAGEN: I realize this is sort of
in contrast to try to shorten it. But I would like
to ask that that time be awarded.
DR. WOOD: Any other thoughts on that?
13
Let me get a sense of the committee. What is the
committee's pleasure about that? Yes?
DR. BOULWARE: I actually support that
recommendation, too, and would suggest
you give
them a limited time, like you did with
the public
comment where you will cut them off at
two minutes,
so we know it will be limited. I would be
interested in the direction they plan to
take. We
heard some startling news yesterday about
the
possible remarketing of a product that
they have
withdrawn.
DR. WOOD: Does anyone object to them
getting two minutes apart from Dr.
think, the answer on that is that that is
fine.
Remind them that, in contrast to most of
their
experiences in the past for senior
managers, the
microphone will be cut off.
DR. STEMHAGEN: Thank you very much. I
think we saw evidence of that yesterday.
DR. WOOD: Right.
So they got the
message; right? Okay.
Let's move along to the
first speaker, Dr. Lyketsos.
Investigator
Presentation
Alzheimer's Prevention Study:
ADAPT
DR. LYKETSOS: Good morning, everyone. I
14
do not have slides. My name is
Lyketsos.
I am a professor at
presenting here today on behalf of the
ADAPT study,
Alzheimer's Disease Anti-inflammatory
Prevention
Trial.
I would like to thank the committee for
inviting us to present. I am here today with my
colleague, Steve Piantadosi, who is also
on the
steering committee and will be available
to answer
any questions that might come up later on
as well.
I have a prepared statement
that will be
distributed to the committee later on
today. I
delivered it to the staff this morning as
I was
arriving.
Before I get into the
statement, I just
wanted to take a few moments to remind us
of the
public-health importance of Alzheimer's
disease to
somewhat set the context about how the
ADAPT trial
has started specifically. Alzheimer's, as we all
know, is a major public-health
problem. It is a
15
devastating disease, typically runs a
ten-year
course of neurodegeneration affecting
probably
close to 4 or 4-and-a-half million of our
citizens
at present and the number is expected to
rise given
the aging of the population of the next
several
decades to approach, perhaps, 12 to 15
million,
based on current projections.
Because of the these
public-health
numbers, there has been a very
significant effort
in our field for the last several years
to develop
preventive strategies for Alzheimer's
disease
because, once neuronal degeneration has
started,
the evidence that treatments work, so
far, is very
weak.
These preventive strategies
have centered
on several possible treatments but the
most
supported by the observational literature
have been
nonsteroidals with over 24 studies right
now
including four prospective population
studies
suggesting substantial reductions of risk
of
Alzheimer's disease perhaps with risk
ratios, in
some cases, as much as 0.4 or 0.5. So it is within
16
that context that ADAPT was started with
the
support of the National Institute of
Aging.
I will move now to reading the
prepared
statement.
The steering committee of the
ADAPT study
welcomes the opportunity to present the
rationale
for its decision, on
the NSAID treatments in ADAPT. This presentation
is important because there is much public
misunderstanding about our decisions and
their
rationale.
The ADAPT Steering Committee is
deeply
committed to the safety of human
subjects, even
more so in the context of prevention
trials where
risks are typically not balanced by any
promise of
tangible near-term benefit. In this notable way,
prevention trials differ from treatment
trials
whose participants may hope for relief of
symptoms
or improved outcomes in a condition
already
diagnosed.
The risk:benefit balance in prevention
trials is even further removed from a
comparison of
17
the benefits of a proven treatment with
its
acknowledged risks. Because ADAPT has not quite
completed the process of auditing and
tabulating
the trial's cardiovascular safety on the
date of
suspension, we cannot, today, present the
trial
safety results at the time of the
decision to
suspend.
We defer that presentation to a
peer-reviewed publication planned for the
near
future.
For today, we note that, even with the
risk:benefit calculus of a prevention
trial, these
data would not, in themselves, have led
to our
decision to suspend either
treatment. In reality,
those decisions were made in very unusual
circumstances. They reflected events external to
ADAPT that raised strong concerns about
the
practicalities of continuing the
treatments.
As the advisory committee
probably knows,
ADAPT is a randomized, double-masked,
multicenter
trial of celecoxib, 200 milligrams twice
daily, or
naproxen sodium 220 milligrams twice
daily versus
placebo for the primary prevention of
Alzheimer's
18
dementia and for the prevention of
age-related
cognitive decline which is, in many
instances, a
prodrome of Alzheimer's disease.
ADAPT also provides an
opportunity to
study the long-term safety of its
treatments in a
healthy elderly population. Eligibility criteria
include an age of 70 years or older at
enrollment
and a health history that excludes many
of the
known risk factor for adverse events with
NSAID
treatments; for example, we exclude those
with
preexisting uncontrolled hypertension,
anemia or a
history of gastrointestinal bleeding,
perforation
or obstruction.
To provide independent
recommendations
regarding continuation of the trial, the
ADAPT
Treatment Effects Monitoring Committee,
or TEMC,
which, I suppose, is our term for a DSMB,
meets
twice a year. In response to emerging concerns
about cardiovascular risks with NSAIDs,
membership
of the TEMC was recently expanded to
include Dr.
Bruce Psaty, a physician with expertise
in
evaluation of cardiovascular risks in
clinical
19
trials.
As an additional safeguard for
participant
safety, the ADAPT study officers and
consultants
also conduct reviews of safety data at
intervals
between TEMC meetings. Amid the emerging
controversy about the cardiovascular
safety of
selective COX-2 inhibitors, the ADAPT
study officer
had been relatively reassured by their
periodic
reviews of the celecoxib safety data. The
study
chair communicated this information in a
telephone
conversation on
Hertz at FDA.
As of
suspension of treatments and enrollment
in ADAPT,
we had enrolled 2,528 participants. Of these,
2,463 had been randomized before October
1 of '04
with some 20 months average duration of
observation. These participants contributed a
total of 3,888 person years of follow up
to
analyses that were presented to the TEMC
on
Those analyses suggested a weak
signal
20
suggesting increased risks of
cardiovascular and
cerebrovascular events with
naproxen. Reviewing
the data, however, we understood well the
TEMC's
evident conclusion that this signal was
not
sufficiently compelling or definitive to
warrant a
recommendation to suspend the treatment
or to
otherwise alter the protocol. This was on December
10, 2004.
Thus, the study officers were
surprised on
December 17 by announcements that two
trials of
celecoxib for the prevention of recurrent
adenomatous colon polyps had been
suspended citing
increased cardiovascular risks with
treatment in
one of these studies, the Adenoma
Prevention with
Celecoxib trial, or APC. This news led to
extensive discussion among the steering
committee
on that day centering on the following
considerations.
Number one; one arm of the APC
trial had
used the same celecoxib dosing as ADAPT,
200
milligrams twice daily, but over a longer
period of
time.
News reports cited a relative risk of 2.5
21
for cardiac events in this arm of
APC. Although
this risk was reported as only
"marginally
significant," a greater cardiac-risk
signal was
reported with the higher APC dosage of
400
milligrams twice daily.
Thus, we took seriously the
possibility of
harm over time to ADAPT participants
receiving
celecoxib. Especially in a prevention trial with
no strong prospects of immediate benefit,
we had
strong misgivings about continuing
celecoxib
treatments.
Knowing almost nothing at the
time about
the
particulars of the APC trial and, in light of
the apparent lack of risk with celecoxib
in the
other prevention trial, we might have
discounted
the APC data and continued
celecoxib. To do so,
however, we would clearly have needed the
concurrence of the seven IRBs that
oversee ADAPT.
These IRBs began almost immediately to
question us
about implications of the APC results and
seemed
likely to question a decision to
continue.
Even if we had persuaded them
to permit
22
continuation of celecoxib using a revised
consent
process, we would surely be involved in
lengthy
discussions with these IRBs. In the meantime, we
would be unable to offer much explanation
to our
participants, thereby endangering the
relationship
of trust that is vital to the success of
long-term
trials.
Number three; as is common in
long-term
trials, ADAPT was experiencing some
difficulty with
adherence to treatments. This difficulty grew
following the withdrawal of rofecoxib and
we
expected the announcement of the APC
results to
exaggerate the problem further with
scores of
participants stopping treatment, in
effect, "voting
with their feet." This would erode statistical
power and increase the potential for bias
in ADAPT.
Thus, even though the ADAPT
safety data
did not, themselves, warrant suspension
of
celecoxib treatments. There seemed little
practical choice but to do so.
We next confronted the dilemma
of what to
do about naproxen and its placebo. As suggested
23
above, we regarded the accumulated
naproxen safety
data as being somewhat more concerning
than the
celecoxib safety data. Yet, they, also, were not
compelling. Although some post hoc data composites
barely reached statistical
significance--these are
post hoc data composites barely reached
statistical
significance for naproxen versus placebo,
no
singular vascular event was clearly more frequent
with naproxen versus placebo.
Furthermore, vascular risks
were not
expected with naproxen treatment. In fact, a
substantial body of prior data at the
time had
suggested that naproxen offers some
cardiovascular
protection. This lack of prior expectation cast
further doubt on the meaning of the
naproxen data
in ADAPT which were vulnerable, in any
case, to the
problem of multiple comparisons.
We could, therefore, have
attempted to
have revised ADAPT to a two-armed trial
of naproxen
versus placebo, instructing our
participant to stop
taking their "white pills," as they are known in
the study, which are celecoxib and its
placebo, but
24
continue to take their "blue
pills," which contain
naproxen and its placebo.
However the dangers were
several.
Participants might end up getting confused
and
taking the wrong pills and many would
stop taking
their treatments altogether. We faced an ethical
dilemma.
The suspension of celecoxib and
continuation of naproxen would have
created the
impression among participants and among
the general
public that celecoxib was risky but
naproxen was
"safe." At least based on the signals from the
ADAPT data, this impression would have
been
misleading.
What would we then tell participants
about
the risks with naproxen as we led through
the
inevitable process of revised consent
necessitated
by the protocol revision. Would the multiplicity
of IRBs even allow us to follow this
course?
Finally, there was another risk
to
consider.
We began ADAPT expecting to see some
increase with naproxen in
gastrointestinal bleeding
and other events. Even though we attempted to
25
reduce these excess G.I. risks by
excluding
participants with prominent risk factors
other than
age, the ADAPT data showed a notable
increase in
G.I. bleeding with naproxen versus
placebo.
Especially amid concerns that
ADAPT was
exposing its participants to potential
risks that
were immediate, while the trial's
hoped-for
benefits lay in the future, the totality
of the
above arguments lead the steering
committee to
suspend both treatments and to also
suspend
enrollment into ADAPT.
As noted above, we expect,
within a few
weeks, to submit a scientific paper for
peer review
and publication. The paper's focus will be on the
process and rationale underlying the
decision to
suspend treatments and enrollment in
ADAPT.
Because these decisions did rely, in some
measure,
on the ADAPT safety data as of 10
December, the
paper will, also, disclose some of these
data.
We are also cooperating with
ongoing
efforts at the NIH to investigate the
cardiovascular and cerebrovascular risks
of NSAIDs.
26
In addition, the NIA and the ADAPT
Steering
Committee are committed to a further two
years of
additional safety monitoring of our
participants.
In preparation for a later,
more
definitive discussion of the ADAPT safety
data, we
plan to revisit a number of the adverse events to
collect additional information and then
to submit
all information available now or later to
a process
of expert adjudication. Depending on particulars,
the latter process will take months. In the nearer
term, we concur with the expert opinion
that,
having taken these widely publicized
decisions, the
steering committee must fulfill its
obligation to
disclose its reasons for doing so based
upon the
data available.
At the same time, we are intent
that our
public presentation even of the current
"working"
data must be at the highest attainable
standards of
accuracy.
Thank you.
DR. WOOD:
Thank you very much. Are there
questions directed to the speaker? Dr. Nissen?
DR. NISSEN: I fully understand your
rationale and I understand that the trial
was
fundamentally stopped because of an issue
of
27
futility.
You didn't think that you could keep
people in the celecoxib arm. That is all well and
good.
The problem that occurred here is that a
warning was issued on naproxen which had
the effect
of being the medical equivalent of
screaming "fire"
in a crowded auditorium.
All over the country, many of
us got calls
from patients saying, "I want to
stop my naproxen
because it causes a cardiovascular risk." I think,
just a comment here, that it would have
been far
better to have announced that the trial
was
suspended for futility rather than for
hazard when
there was a non-statistically significant
hazard.
So, one man's comment.
DR. WOOD: I agree with that. Any other
comments?
Yes?
DR. FARRAR: I wonder if you could comment
on the G.I. bleed component since,
obviously, one
of
the deliberations we have to undertake is the
28
relative problems with G.I. bleed versus
cardiovascular risk. Certainly, that was known a
priori before starting the study.
As you commented very
carefully, that
wasn't the only consideration. But, in a drug
trial where the outcome is unknown and
the risk is
really fairly well known, I wondered how
you
thought about that in terms of putting
patients at
risk of something on the order of a few
percentage
over the course of a five-year trial who
might have
serious complications from the G.I.
bleeding.
DR. LYKETSOS: I guess you are asking me a
human-subjects question.
DR. FARRAR: I am asking how, in the
design of the study, obviously the choice
was made
to accept that risk for the unknown
potential
benefit of reduction in Alzheimer's
disease over
the course of the same trial. I am wondering if
you have any insights into how that
decision was
made because, clearly, there are issues
there about
the use of these drugs and their risks.
DR. LYKETSOS: Well, I am glad you are
29
asking the question. It certainly is an issue that
we have spent a lot of time discussing
and which we
discussed with study sections, IRBs, at quite
some
length and continue to discuss.
I think the fundamental point
that I would
start with is where I started my
presentation which
is the devastation that Alzheimer's
disease brings
and the fact that all the study
participants were
individuals who had a first-degree
relative with
the disease and had, therefore, personal
experience.
In that context, we were very
careful and
very clear with them about what we thought
at the
time the known G.I. risks were so that,
in the
process of consent, and that was revealed
through
careful discussions in the consent
process as well
as the consent form, the risk of G.I.
bleed was
stated very clearly and that that, in
some cases,
might lead to death.
So I think we felt that this
was a
decision that our participants could
make, given
that the risks were relatively small, and
the risk
30
that they would develop Alzheimer's
disease was
higher and that we felt they could make
the
decision for themselves if they were
willing to
take the risk:benefit calculus as we saw
it.
DR. WOOD: Dr. Gibofsky?
DR. GIBOFSKY: I share Dr. Nissen's
concern about this effect of crying fire
in a
crowded theater. Many of our patients called and
suggested that they were going to stop
their
celecoxib because of the concerns that
were raised
from ADAPT as well. But you raised a very
interesting concern that I confess I
hadn't given
enough thought to and that is the
difference
between a prevention trial and an outcome
trial.
Much of our discussion here
later today, I
suspect, is going to focus on what action
should be
taken, if any, to restrict drugs based on
treatment
from data on prevention trials. I would be very
curious to hear you expound on that a bit
more.
DR. LYKETSOS: That is an interesting
question.
Let me just, if I could, because there
have been three comments now--I just
would like to
31
refer you to the early part of my
statement where I
said the presentation is important
because there is
much public misunderstanding about our
decisions
and their rationale.
Several of you pointed out that
there was
a cry of fire. I don't believe that that came from
the study.
DR. WOOD: We won't ask you to speculate
where it came from. There is certainly a view on
that.
DR. LYKETSOS: I am not sure where it came
from.
But, to address the other issue, I must say
I have not given it much thought as to
whether
prevention-trial safety data would
generalize in
the way that you are thinking about
it. So I will
defer on that because I think it would
need a fair
bit more thought by people who are more
expert in
that.
DR. WOOD: Dr. Fleming.
DR. FLEMING: It is my understanding, from
what you are saying, that the steering
committee
was particularly influenced by the APC
prior data
32
not by the internal data from ADAPT;
i.e., there
were, from you were describing, some
emerging
trends that, in my words, were in the
unfavorable
direction but in the context of
monitoring trials,
we know that one has to be extremely
cautious, when
you are looking at data continually over
time, not
to overinterpret emerging trends that can
easily
ebb and flow.
So my understanding, from what
you are
saying, is it wasn't that there were, at
this
point, some emerging trends that happen
to be in
the unfavorable direction on
naproxen. Rather, it
was the external data on the APC trial
for Celebrex
that was the driving issue behind the
recommendation.
DR. WOOD: Just to develop that question,
what I understood you to say was you
hadn't passed
some stopping boundary; is that correct?
DR. LYKETSOS: I'm sorry?
I didn't hear
the first--
DR. WOOD: You hadn't violated your
stopping rule, or whatever stopping
rules, you had
33
for safety.
DR. LYKETSOS: I think that our TEMC, our
DSMB, had opined the week before with the
same data
from within the trial that they felt that
we should
continue.
So it was interesting how the two events
were back-to-back.
DR. FLEMING: I would like to come to that
second.
I am leading to that. But first I
wanted
to make sure that I understood what was
the nature
of the concern. Is my interpretation correct?
DR. LYKETSOS: I think so.
Back to how I
put it, the issue really was one of
practicalities
more than our internal data, is that we
felt we
would have to talk to IRBs and
participants and
tell them something about--
DR. FLEMING: Could I first understand
what your sense of the evidence was. I want to
discuss that first, versus the
practicality.
DR. LYKETSOS: The sense of the study
evidence.
DR. FLEMING: The sense of the evidence
that was the basis for the decision in
terms of
34
adverse effects. I have heard two things. One is
the naproxen, but that was not compelling
evidence.
That was within the framework of emerging
results
that could be by chance alone when you
are
monitoring data frequently. But external APC data
was very influential to you. That is what I am
hearing.
Is that correct?
DR. LYKETSOS: Well, in fact, we didn't
know all the details of the APC data, as
I pointed
out.
I think it was that plus the climate that had
been created by rofecoxib coming off the
market,
the influence that that had to some
extent on our
participants, then the widely publicized
APC
results and the sense that, even though
the data we
were seeing and that our TEMC the week before had
seen, did not compel us to stop treatment
based on
our own data, that there was now a
climate created
where, practically speaking, we had to
stop and
take stock and get more information, et
cetera.
So it was that sort of the
decision. I
was a complicated decision and that is
why it takes
a three-page statement to try and explain
what went
35
through our minds.
DR. FLEMING: There may not have been, to
the steering committee at this time,
access to data
on PRECEPT for celecoxib or to the
etoricoxib, the
lumiracoxib, data on naproxen that were
very
favorable, but you did have access to the
VIGOR
data which was very reassuring for
naproxen and you
had evidence from the CLASS trial and
some other
data from Celebrex.
I am perplexed that you would look
at the
totality of these data and say that the
results
were conclusive in terms of at least not
being able
to provide information to the IRBs and to
the
patients and caregivers in the trial
representing
the totality of the data when your
data-monitoring
committee had looked at the totality of
the
evidence for benefit to risk.
On a data-monitoring committee,
I have
always argued, don't just show me the
safety data,
even if we are just looking at early
assessments
for safety. It always has to be benefit to risk.
Even though, as you are pointing out,
this wasn't a
36
therapeutic setting, prevention trials
also provide
major opportunity for benefit. Preventing major
diseases is also a very significant
benefit.
My understanding is your
data-monitoring
committee, in looking at the data,
looking at the
benefit as well as the risk, indicated the
study
should continue. How did the steering committee
judge, without access to ongoing data,
that benefit
to risk couldn't be sufficiently
favorable and that
a notification to the investigators, to
the
patients and to the IRBs, that the
monitoring
committee has carefully looked at benefit
and risk
and that the totality of the data is
beyond the APC
trial when you are looking at Celebrex
and
naproxen?
Why wasn't that strategy pursued?
DR. LYKETSOS: First, as I pointed out in
my statement, some members of the
steering
committee did have access to the data
that the DSMB
had seen.
That is the first point. The second
point is, as you point out and as I think
this
whole discussion points out, is these are
very
difficult judgment calls. They have to take into
37
account evidence but also practical
aspects of
continuing to conduct this sort of a
prevention
trial in this sort of a population.
I think it was the judgment
call, and I
can tell you, there was substantial
discussion
around this when we had the steering
committee
meeting, about these very issues. It was the
collective judgement at the time that
this was the
right thing to do, given the various
issues that I
have articulated in my statement.
DR. FLEMING: I will just pursue
one more.
I am dismayed to hear the steering
committee, some
steering committee members, had access to
the data.
That is also a violation of the
principles of
monitoring trials. It should have been in the sole
possession of the data-monitoring
committee.
I am also distressed because I
am not
hearing that monitoring committee was
front and
center in terms of having these issues
brought back
to
it for reassessment. So, to me, what I
am
hearing raises very significant concerns
about
putting at risk the integrity of studies
with
38
prejudgments using only access to partial
external
information.
DR. WOOD: There was one other thing,
though, at least the word on the street
was, and
you sort of mentioned that as well, I
understood
there was a very large number of dropouts
from the
trial after the Vioxx withdrawal and
others and
that one of the perceptions was it was no
longer
possible to continue the trial. Is that true?
DR. LYKETSOS: Let me clarify that. The
adherence had been declining on an annual
basis
even before rofecoxib was withdrawn from
the
market.
So adherence was perceived as an issue in
that we felt that now there were data
about one of
the study drugs and that that would
further erode
adherence. We did not see a huge erosion in
adherence with rofecoxib, specifically,
but there
had already been an erosion that was
concerning and
we anticipated a further erosion.
DR. WOOD: Right.
But the question for
this committee that Dr. Fleming is
pursuing
vigorously, and I agree with him, is that
the
39
announcement that you all made--the
announcement,
as it was picked up--maybe I should put
it like
that--was that this trial was being
stopped for a
safety signal.
What I heard in your statement
and what I
hear from you now is that the trial was
being
stopped for operational problems in the
trial and
the safety signal was a convenient moment
at which
to do that. But you had operational difficulties.
That is a very different interpretation
and a very
different interpretation for the public
and
patients.
Is that what you are hearing,
Tom?
DR. FLEMING: It certainly appears to be.
It is part of what is concerning to me.
DR. LYKETSOS: I think my statement should
speak for itself. In terms of what the data were,
as I have pointed out, they will be
submitted very
soon so that you can judge for
yourselves.
DR. WOOD: Okay.
Any other questions?
Sorry; Dr. Farrar. I beg your pardon. Dr. Farrar,
go ahead.
DR. FARRAR: I think, actually, that this
study provide some vitally important
information
with regards to our consideration of the
entire
40
class of drugs; namely, the NSAIDs. I would like
to just read on sentence from the
statement.
It said, "Although some
post hoc data
composites barely reached statistical
significance
for naproxen versus placebo." Now, clearly, this
discussion would be much clearer after
the
presentation of the data, a careful
review of the
data.
But Dr. Fleming noted that, in the VIGOR
study, there was some reassurance about
naproxen.
I would like to just question that.
What is very clear in the VIGOR
study is
that naproxen was safer than
rofecoxib. But it
does not comment at all with regards to
the
potential risk compared to placebo. In fact, I was
surprised when I heard the statement by
Dr. Fleming
because, in fact, I have assumed, based
on all the
data that we have, that every NSAID will
not fare
well against a placebo.
I think that this data, and
probably will
41
be supported by the publication although
I don't
want to try and foresee the future, but
my guess is
that naproxen will not fare particularly
well
against placebo in terms of its
cardiovascular
safety.
I think we need to be able to accept the
fact that all of them have some risk with
regards
to cerebrovascular disease and this study
is likely
to provide the data to support that.
DR. WOOD: Dr. Nissen?
DR. NISSEN: I don't want to belabor this
because we have got a lot more to discuss
today,
but I think it is extremely important
that, as a
medical community, we learn from this
episode. In
the kind of media frenzy that was going
on during
that period of time, this announcement,
this
warning that was issued on a national
basis about
naproxen, was inappropriate, led to some
panic
amongst the public and we simply can't do
business
this way.
We can't operate in this kind
of a
fashion.
I would urge any of the individuals who
were involved in the decision to issue a
warning to
42
go back and look at what happened and try
to ensure
that we don't do this sort of thing
again, because
once this gets picked up by the media, it
passes
through generations of people and becomes
the topic
of extensive discussion and may lead
patients who
don't have the ability that we have
around this
table to filter data--they don't
understand
data-safety and monitoring boards. They don't
understand stopping rules. And it caused a panic
that was unnecessary and it shouldn't
have
happened, and I hope it doesn't happen
again.
DR. WOOD: Thanks very much. Let's move
on to next speaker, Dr. Packer.
Additional Background Presentations
Interpretation of Observed
Differences in the
Frequency of Events When the
Number
of Events is Small
DR. PACKER: Thank you, Alastair, members
of the advisory committee, FDA, ladies
and
gentlemen. Today I have been invited by FDA to
address a specific question which is how
should be
interpret differences in the observed
frequency of
43
events in a clinical trial when the
number of
events is small.
Let me just say arbitrarily
that I will
define, for purposes of today, what I
mean by a
small number of events and that would
have provided
less than 70 percent power to have
detected a true
treatment difference assuming an effect
size
similar to that generally encountered in
clinical
research.
This is just a thought. Just suppose you
do a trial for a noncardiovascular
indication and
you note that there are 13 major adverse
cardiovascular events in the placebo
group and 33
such events in the drug-treatment
group. How
should this difference be interpreted?
Many would simply perform a
statistical
test, derive the p-value, and get excited
if the
p-value were less than some arbitrary
value such as
0.05.
In this example, the p-value of 0.002 would
suggest, to some, that this difference
between 13
and 33 in a trial of about 3,000
patients, would
have been observed only two times out of
1,000, an
44
effect unlikely to have been due to the
play of
chance.
However, before getting
excited, we should
remember that p-values must be
interpreted in some
context.
P-values are most easily interpreted when
they refer to predefined primary
endpoints in
trials adequately powered, more than 80,
90 percent
power, to detect differences between
treatments.
However, even under such circumstances,
p-values
are not necessarily reproducible.
Bob O'Neill and others have made the point
that, if a p-value in the trial is 0.05,
the
likelihood of seeing 0.05 in a second
identical
trial is only about 50 percent. It is only when
the p-value in the first study is 0.001 that
the
likelihood of seeing 0.05 or less in the
second
identical trial is at least 90 percent.
These calculations are the
basis of the
frequent FDA guidance that, to
demonstrate
persuasive evidence for efficacy, a
sponsor needs
to provide two trials with 0.05 or less
or one
trial with a very, very small p-value.
But what if the event was not
the primary
endpoint in the study? What, in fact, if the event
was not even precisely defined before the
start of
45
the trial? What if the trial was not adequately
powered to detect a treatment difference
for the
endpoint?
What does a p-value mean under these
circumstances?
Unfortunately, this happens
quite
frequently in clinical trials under a
variety of
circumstances. But it is particularly true in the
analysis of adverse events. So lets make a list of
things to worry about when using p-values
to
compare the frequency of adverse events
in a
clinical trial.
First, there are literally
hundreds of
adverse events in a clinical trial and,
therefore,
there are hundreds of possible comparisons
that can
be made.
Now, this is classically referred to as
the multiple comparisons problem. For example, if
a typical large-scale clinical trial
yields as many
of 500 individual terms describing
adverse events
and if a p-value were calculated for each
pairwise
46
comparison, one would, of course, by
chance alone,
expect about 5 percent of the terms, or
about 25
events, at a p-value of 0.05 or less and
1 percent
of the terms are about 5 events to have a
p-value
of 0.01 or less.
The second issue in
interpreting
comparison of frequency of adverse events
is the
fact that adverse events are spontaneous
nonadjudicated reports. Now, adverse events are
reported at the discretion of the
investigator and
then translated into standardized
terms. There is
little uniformity on how an event is
identified,
defined or reported and this uncertainty
increases
when the event is in a field remote from
the
investigator's focus.
Now, some of you may believe
that you can
fix this problem by carrying out blinded
adjudication of events after the fact.
Unfortunately, the rules guiding post hoc
adjudication are inevitably influenced by
the
knowledge that a treatment difference has
been
seen.
In fact, any bar set by a post hoc process,
47
is capable of magnifying or diluting an
effect.
For example, if you set very
strict
criteria, a committee could reduce the
number of
events and, therefore, reduce statistical
power.
By setting very loose criteria, the
committee can
include many questionable events and
reduce the
magnitude of a treatment difference.
To make things more
complicated,
adjudication committees do not generally
examine
individuals who did not report an event
to make
sure they didn't have an event.
The third issue in interpreting
comparisons of frequencies is that some
signals are
apparently only if adverse events are
grouped
together.
Now, that is not much of a problem if
the difference is fairly straightforward
and
focuses on one single event. But things can become
a little bit more complicated if the
analysis
requires a combining event and combining
trends
across two or more events in order to
reach some
magical level of statistical
significance.
Now, the problem is that these
groupings
48
are frequently constructed after the
fact, making
it possible to include only events that
showed the
trend the investigator is interested
in. For
example, if an investigator believed the
drug
increased the risk of a major
cardiovascular event,
he or she might first look at myocardial
infarction
and stroke, but, finding little
difference here, he
or she might be tempted to look at other
related
events; for example, not seeing a difference
in
myocardial infarction, an investigator
might be
tempted to broaden the definition of a
myocardial
ischemic event to include sudden death or
unstable
angina if the differences between the
groups
supported some predetermined judgment.
Similarly, not seeing a
difference in
stroke, an investigator might be tempted
to broaden
the definition to include a TIA. But the
possibilities of grouping is very, very
large and
the possibilities of finding something,
if you want
to be creative, are also quite large,
even though
these differences may be related to the
play of
chance.
As a result, the definition of
grouping
may vary from study to study. Now, some
investigators try to fix this problem by
setting up
49
a uniform definition to be used across
all studies.
But when the definition is developed
after a
concern has been raised, those creating
the
definition have frequently already looked
at the
data or have communicated with those who
have
looked at the data, and know either
consciously or
subconsciously what kind of definition is
required
to capture the events of interest.
The fourth, and what I want to
focus on
the most in my presentation, is the issue
of
interpreting comparisons of frequency of
adverse
events because the number of adverse
events is
small and, because they are small, they
result in
extremely imprecise estimates.
Now, you may think that
investigators
generally understand the difficulties of
analyzing
small numbers of events. For example, most
investigators know that, when the number
of events
is small, the lack of an observed
difference does
50
not rule out the existence of a true
difference.
We have been taught that this should be
apparent by
looking at the confidence interval and,
as you can
see here, the confidence interval is very
wide and
includes the possibility of benefit and
harm.
So investigators, basically,
consider
these kind of data to be
inconclusive. But what is
generally not appreciated is that, when
the number
of events is small, the confidence
interval is
necessarily so wide that it may not truly
represent
the range of values that would include
the true
effect of the drug. As a result, even the finding
of an observed difference does not
necessarily
prove the existence of a true difference.
To illustrate this point, this
slide shows
the effect size and confidence intervals
required
to reach statistical significance in a
hypothetical
trial of 3,000 patients assuming a range
from a
very small to a very large number of events.
Now, assuming the trial shows a
statistically significant effect--that
means that
we are only going to look at this if a
p-value,
51
let's say, is less than 0.05--the smaller
the
number of events, the larger must be the
treatment
effect in order for this effect to be
statistically
significant and the wider the confidence
intervals
have to be.
Put it another way, if the
number of
events is small, the trial will show a
significant
difference only if the treatment effect
is very
large and the estimate of the effect is
very
imprecise.
Unfortunately, when you look at
adverse
events in a trial, the number of events
will always
be small.
This is because the trial, as you know,
was designed to provide enough data to
examine the
primary endpoint, the trial produces a
very precise
estimate of, but it is not powered to
look at any
other analyses and, therefore, at the end
of the
trial, you get generally a less precise
estimate of
the secondary endpoint and an extremely
imprecise
estimate of any specific adverse event.
Now, you may ask, what is wrong
with an
imprecise estimate? Well, imprecise estimates are
52
fine if the intent is to withhold
judgement until
more data are collected to make the
estimates more
precise.
But imprecise estimates are problematic
if the intent is to stop and reach a
conclusion.
That is because, when
calculated in the
usual manner, p-values and 95 percent
confidence
intervals are most easily interpreted in
the
context of a completed experiment. Unfortunately,
the adverse-event data generated in a
typical trial
is not the result of a completed
experiment. In
fact, viewed from the amount of data
needed for a
precise estimate, the adverse-event data
in a
single study only represents a snapshot
of an
ongoing experiment to characterize the
safety of
the
drug.
As a result, performing an
analysis of
adverse-event data is akin to performing
an interim
analysis of primary endpoint data in an
ongoing
clinical trial. Now, this is important because we
know a fair amount of how to interpret
interim
analyses in a clinical trial and here I
really must
apologize to Tom Fleming because what I
am going to
53
review here very quickly is borrowed
heavily from
his extensive work in this area.
But it is really important to
think about
small numbers of adverse events as an
interim look
on a global effort to characterize the
safety of a
drug.
Now, as you know, when you look at interim
analyses in a clinical trial, one plots
the
treatment difference represented by a
z-score
against the amount of information that we
have, and
that is generally represented by the
fraction of
expected events.
We start the trial at zero
effect and zero
information. At the end of each interim analysis,
we add a point until we get to get to the
end of
the study. Now, if we have assigned an alpha of
0.05 to the endpoint, we want to make
sure that we
evaluate the treatment difference seen at
the end
of the trial against an alpha of about
0.05 which
generally corresponds to a z-score of
about 2.0.
Now, some might think, naively, that,
during the course of a study, the
observed
54
difference between treatments will be so
predictable that we would observe a
linear march
between the start of the study and the
end of the
trial.
But know that when the amount of data is
small, things tend to bounce around a
lot, so much
so that early results can be very
misleading.
It is sort of like the
situation of trying
to predict the results of an election
when only 1
percent of the precincts have been
reported and
they are not even representative. So, as a result,
if we got excited about any difference in
z-score
more 2.0 early in the trial, we would be
getting
excited about effects that were not
likely to be
seen or sustained if we had more data
even though a
z-score of 2.0 would normally correspond
to a
p-value of less than 0.05.
In fact, the smaller the amount
of data,
the more things can bounce around a lot,
the more
it is likely that what we will be seeing
will be
due to the play of chance. Therefore, to prevent
investigators from reaching a conclusion
when the
estimates are imprecise, statisticians,
55
particularly Tom, have recommended that
investigators refrain from getting excited
about
nominally significant z-scores when the
amount of
data is scarce.
Specifically, they have
proposed that
boundaries must be crossed before we can
feel
comfortable that an effect seen early is
likely to
be present at the end of an experiment.
Now, Tom, in particular, has
proposed a
curvilinear boundary like this. There are many
other boundaries that have been performed
by
others.
But this is very, very commonly used in
the
an alpha of 0.05 for a primary
endpoint. It sort
of looks like this. Because it is curvilinear, to
be significant at the 0.05 level, the
treatment
difference must be extreme when the
amount of
information is small as would be the case
early in
the study.
However, as the trial proceeds,
treatment
differences required to conclude that
there is an
effect at the 0.05 level decreases and
become
56
closer and closer to a z-score of about
2.0 at the
end of the study.
Now, this is a very different
thought
process and a very different approach
than getting
excited about a p-value less than 0.05 no
matter
when you observed it during the
study. For
example, a z-score of 2.5--that is right
here--would be meaningful if seen at the
end of the
study but it wouldn't be considered
significant if
seen early in the study even though the
nominal
p-value at this time is less than 0.05.
Now, if the number of events is
small, the
difference would need to be far more
extreme--say,
a z-score up here--to be meaningful at
the 0.05
level.
Here is a specific
example. This is an
old cardiovascular trial. This is the Coronary
Drug Project. It was carried out more than 30
years ago.
It included a comparison of clofibrate,
a lipid-lowering drug, and placebo on
coronary
events.
At four separate times during the study,
the difference in favor of clofibrate was
57
statistically significant at a nominal p
of 0.05 or
less.
But, at the end of the trial, there was no
difference between placebo and
clofibrate. The
difference seen early in the trial was
related to
the imprecision inherent when analyzing
small
numbers of events.
In fact, if a boundary had been
used in
this study, at no time during the trial
would the
treatment effect have crossed the
boundary and led
to
the conclusion that clofibrate was better than
placebo.
Now, let me say this kind of
fluctuation
early in a study is very, very
common. There are
even examples that at treatment has been
associated
with a nominally significant adverse
effect which
later was reversed during the course of
the trial
and became statistically significant at
the end of
the study.
Now, I should mention that the
boundary
that I have shown you is a boundary with
an alpha
of 0.05.
This means, when the boundary is crossed,
the p-value for the treatment effect is
less than
58
0.05 not less than the nominal p-value
that
corresponds to the disease score that
allowed the
boundary to be crossed.
Now, for each p-value or each
alpha, there
is a separate boundary. The requirement for
strength of evidence as it becomes more
stringent,
the boundary is shifted upward and to the
right.
You might ask why am I going
through all
this.
Because analyzing data derived in an
underpowered trial raises the same
concerns as
analyzing data derived from an
underpowered interim
analysis in an adequately powered study.
The cardiovascular field is
replete with
examples of how misleading small numbers
of events
can be.
Let me give you a few examples.
For
example, in an early pilot trial, the
ACE/NEP
inhibitor, Omapatrilat, reduced the risk
of a major
cardiovascular event by 47 percent when
compared
with an ACE inhibitor. As you can see, the
confidence intervals are extremely wide
because the
analysis here was based on only 39
events.
Later, a definitive trial was
carried out
59
that recorded nearly 1900 events. There was no
difference between Omapatrilat and the
comparator
ACE inhibitor on the same endpoint in the
same
population.
Here is another example. In an early
pilot trial, amlodipine reduced the risk
of a major
cardiovascular event by 45 percent, small
p-value
but wide confidence intervals. Later, in a
definitive trial which recorded four
times as many
events, there was no effect of amlodipine
on the
same endpoint in the same population using
the same
investigators.
There are even examples when
the effect
seen in a pilot trial was reversed when
the
definitive study was carried out. Two examples.
In two pilot trials, both in heart
failure, one
with the drug Vesnarinone, one with the
drug
Losartan, both drugs significantly
reduced the risk
of death--not a minor endpoint; death--by
50 to 60
percent.
But these benefits were seen in trials
that were each recorded fewer than 50
events and
thus produced treatment estimates with
extremely
60
wide confidence intervals.
When both drugs were
reevaluated in
definitive trials that recorded ten times
as many
events, both drugs were associated with
increased
risks of death, in one case, significant
at the
less than 0.05 level.
Now, notice that the confidence
intervals
of the treatment effect in the definitive
trials do
not overlap with the confidence intervals
of the
treatment effect in the early pilot
studies. So
here we have an effect, two examples, of
an
underpowered trial that showed a significant
benefit whereas the definitively powered
study
showed significant harm.
Here is another example. This is a
meta-analysis of a small number of trials
looking
at the effect of magnesium in acute
myocardial
infarction. A meta-analysis of a number of studies
showed intravenous magnesium associated
with the
striking reduction in mortality, a 55
percent
reduction in risk of death, but wide
confidence
intervals, a very small p-value, in a
fairly large
61
study.
This effect appeared to be
reinforced
smaller treatment effect but wide
confidence
intervals and then, subsequently, in a definitive
trial that recorded 4,000 deaths, there
was a
nearly significant adverse event of
magnesium on
the same endpoint in the same population.
Now, again, please note that
the
confidence intervals of the treatment
estimate in
this definitive study do not overlap at
all, with
the confidence intervals of the estimates
in the
earlier moderately sized study, and not
at all in
the meta-analysis. Again, this is really a
reflection of the imprecision inherent in
looking
at small numbers of events.
Let me give you one final
example because
it actually deals with an adverse
effect. In an
early pilot trial with extended-release
metoprolol--this is a study that looked
at a very
small number of events, about 20 events,
showed a
three-fold increase in the risk of
hospitalization
of heart failure in the metoprolol group
compared
62
with the placebo group. Look at the confidence
intervals here. They go from about Washington to
California, very, however, nominally
significant
treatment effect.
When this trial was replicated
in a
similar population with exactly the same
drug,
exactly the same formulation, exactly the
same
dose, there was now a reduction in the
frequency of
hospitalization for heart failure. Let me just
emphasize, this was recorded as an
adverse event in
this earlier trial.
So what have we learned from
all this?
Well, a couple of thoughts. To achieve statistical
significance in an underpowered analysis,
the
effect size must be extreme and the
estimate must
be imprecise. Yet the more extreme the effect, the
more imprecise the estimate, the less
likely it
will be reproduced in a definitive
trial. That is
why I think, of all the things that we can
worry
about in looking at adverse events, the
most
worrisome is the imprecision inherent in
the
analysis of small numbers of events.
Let me just close with a few
final
thoughts.
You might ask, based on all of this,
what should we do. Well, I think the first step,
63
perhaps the most important first step, is
to
develop an approach to analyzing data in
trials
with small numbers of events which
actually
accurately reflects the true imprecision
of the
treatment effect estimate and its
statistical
significance.
Let me just emphasize one
thing, and I
just want to put this as a proposal. In no way,
would I propose this as a definitive
solution but,
to get the discussion going, this might
be an
interesting first way of thinking about
this.
The conventional way of
comparing small
numbers of events is to calculate 95
percent
confidence intervals followed by the
derivation of
the p-value. However, the conventional calculation
of the confidence intervals incorporates
into it a
z-score that the investigator designates
as the
target value for statistical
significance. For
example, most statisticians, in
calculating a
64
confidence interval, would simply use a
z-score of
about 2.0.
And they would do that because
that is the
critical value for the z-score at the end
of an
adequately powered trial with an alpha of
0.05. So
what they would do is they would take
this z-score
and they will use it to calculate the
confidence
interval.
What a lot of people, I think, fail to
realize is that this z-score is not the
critical
value for decision making if one looks
early in the
same experiment.
Early in that experiment, the
critical
value for a z-score should be determined
by the
interim monitoring boundary appropriate
for the
information content, not the z-score at
end of the
study.
Now, if one uses the boundary
z-score in
the calculation of the 95 percent
confidence
intervals, the confidence intervals here
will be
much, much wider resulting in a p-value
that will
no longer be statistically
significant. Now this
is important because everyone talks about
p-values
65
at these meetings. I showed you these data before.
Conventionally calculated, the p-value
would be
0.002 meaning the likelihood of chance
alone being
2 in 1000.
Well, if, in fact, if one
recognized that
the data here really result in a very
imprecise
estimate and one incorporates the
thinking process
of an O'Brien-Fleming boundary into this,
as a
reflection of this imprecision, then the
confidence
intervals now truly reflect the
imprecision in the
estimate and now the p-value is a lot
interesting
than it was before.
Now, the use of boundary-adjusted
confidence intervals would, I think,
appropriately
describe the great uncertainty inherent
in the
analysis of small-numbers events,
hopefully
markedly reducing the false-positive
error rate.
In spite of using a
boundary-adjusted
confidence interval, adverse effects that
are known
to be characteristic of specific drugs
would
generally remain statistically
significant.
However, this approach, and it is just a
thought
66
experiment, would not provide a way to
interpret
trends observed in imprecise data.
So, lastly, let me just
conclude with some
thoughts about what we should do with
worrisome
trends in imprecise data. The first thing we could
do is believe in those that are
biologically
plausible. However, we need to be very careful
here.
Everyone knows physicians can always be
relied on to propose a biological
mechanism to
explain the validity of an unexpected and
potentially preposterous finding simply
because it
happens to have an interested
p-value. Anyone who
doesn't believe this, you know, I would
be happy to
show you overwhelming evidence that this
is the
case.
Second, is we could look for
confirmatory
evidence in other studies reminding that
we
shouldn't be selective. But, even if every study
showed the same trend, how would you know
that you
had enough evidence to reach a
conclusion? Some
have proposed doing a cumulative
meta-analysis in
which each trial is considered to
represent an
67
interim analysis on the way to a final
judgement.
Indeed, Salim Yusef has
proposed that, as
each trial is added to the meta-analysis,
that one
use interim monitoring boundaries to
interpret this
cumulative meta-analysis. This has, certainly, a
considerable amount of appeal.
Let me just emphasize. Salim has, in
fact, underscored the fact that the
conditions here
are not identical those that exist for a
true
interim analysis. In the case of a true interim
analysis, we generally know that the
types of
patients in studies are similar at all
observation
points.
Here it is different.
In the case of a cumulative
meta-analysis,
the types of patients in studies differ
across the
various trials. So, as a result, Salim has
proposed that, when reaching a conclusion
based on
data that has been combined across
trials, that a
boundary more strict than 0.05 be used.
Now, he has specifically
outlined the
importance of this using the example of
intravenous
magnesium. I showed you the data on intravenous
68
magnesium in myocardial infarction. When the early
trials with magnesium were carried out,
the z-score
of greater than 2.0 was crossed
early. As the
cumulative evidence occurred, the initial
boundary
of 0.05 was crossed.
But then a large study, when
added to the
other cumulative analyses, brought this
treatment
effect down to a 0 level. So Salim, and others, in
fact, have emphasized that, when you are
using a
meta-analysis approach and using
intra-monitoring
boundaries, that maybe one should require
a p-value
of less than 0.05 or even, perhaps, a
small
p-value.
Let me say that most of the
effects the
committee has seen over the past two days
would not
come even close to meeting these
criteria.
Now, some of you may say, why
not avoid
all of this uncertainty and simply carry
out an
adequately powered definitive trial with
the
adverse event as the primary endpoint. Is this
crazy?
No; it is not crazy at all.
Sponsors
pursue encouraging trends. Most are disappointed,
69
but they will pursue them. Sponsors, therefore,
should have an obligation to pursue
discouraging
trends realizing that most of them
probably won't
be confirmed either.
On a definitive trial can
address
ascertainment and classification biases
as well as
concerns about multiplicity of
comparisons and
imprecision of the data. However, can we really
expect sponsors to pursue every adverse
trend?
There are some obvious limitations to
doing this.
Furthermore, if you could decide which
adverse
trend you wanted to pursue, how easy
would it be to
carry out the trial intended to
definitively
evaluate an increased risk of an adverse
effect?
Can you imagine the consent
forms for the
IRBs for such a study? Some may say that we are
being too stringent here, the that
criteria of
raising a safety concern need not be as
stringent
as the criteria for establishing
efficacy. But I
am not so sure that the criteria for
establishing
efficacy and safety should be that
different.
As a rule, we are very strict
in reaching
70
conclusions about efficacy because saying
that
there is a benefit when there is none
means that
millions will be treated unnecessarily
and subject
to side effects and cost. Now, although some might
advocate being less strict in reaching
conclusions
about safety, please remember; saying that
there is
an adverse effect when there is none
means that
millions will be deprived of an effective
treatment.
In conclusion, the findings of
controlled
trials are most easily interpreted when
they
represent the principal intent of the
study. A
non-principle finding is subject to many
interpretive difficulties many of which
we have
reviewed; ascertainment biases, inflated
false-positive rates due to multiplicity
of
comparisons and, the one I have
emphasized the
most, the imprecision of estimates
inherent in the
analysis of small numbers.
I think FDA, industry and
academia remain
in a quandary as to how to respond in a
responsible
fashion to observe differences in the
reported
71
frequency of adverse events. Let me just
emphasize, my presentation shouldn't be
construed
as favoring one particular side in all
the
discussions that have occurred. In my view,
regardless of one's position, it is
critical to
understand the limitations of what we
know and to
resist the temptation to reach
conclusions before
we are justified to do so.
I think only by recognizing our
ignorance
will we be able to take the first step
towards
developing a rational approach that is in
the
interest of all patients.
Thank you. I will be happy to answer any
questions.
DR. WOOD: Dr. D'Agostino?
DR. D'AGOSTINO: Thank you very much,
Milt.
I have a couple of questions that I think, I
hope, are relevant to our
deliberations. In terms
of your sense of large and the idea of
chasing
after a safety event and making more out
of it than
one should, we have a study approved
where there
was a serious up-front prestated deliberation
to
72
make sure they had good ascertainment and
adjudication of cardiovascular events,
and they
come up with 45 versus 25 events,
carefully
collected.
I am struck by that's being
small, but I
am also struck by the carefulness in
which it was
done, say, as opposed to the APD where
they did an
interim analysis that has those
problems. Could
you comment on, say, the approved study?
DR. PACKER: I think that, when you have
incomplete data, as you would if you have
small-numbers events, you need to be a
lot more
careful about the thinking process. That doesn't
mean you can't make judgments. It doesn't mean you
can't incorporate a set of principles
that would
guide decision making by looking at the
totality of
the evidence and bringing to the process
what you
inherently believe. I think that is what the
committee needs to do today.
What I really wanted to
address, however,
is how hard this is and that the normal
reliance--as you know, clinical
investigators,
73
because they don't understand p-values,
rely on
them.
What I am trying to do is to explain that,
in fact, we are less certain about what
we know
here than we, perhaps, should be.
DR. D'AGOSTINO: But that is on the
approved, studies, it was reasonable,
too.
DR. PACKER: I think you need to take that
in the totality of the carefulness in
which it was
done, the prospective nature of it. But, remember,
in
all the examples that I showed you, the trend
seemed sometimes very striking trends in
early
pilot trials that were prespecified,
adjudicated
endpoints but, because they were
small-number
events with very imprecise estimates, the
definitive trial was non-confirmatory.
So just because it is up-front
and
predefined--
DR. D'AGOSTINO: That is my question, yes.
That is my question. You still end up with small
numbers.
Let me have just a couple of other
questions. The second question is really bothering
me very much in terms of how we would
recommend
74
trials.
If you decide--if the group decides and
suggests to the FDA that there should be
more
trials, more randomized clinical trials,
the
sponsors are, then, going to have to go
back and
say, well, they are going to set up a
trial saying
the null hypothesis that the relative
risk is 1.0
versus the relative risk is not 1.0.
Now, the best thing a sponsor
can do is to
run a very sloppy study and they will
accept that
null hypothesis because the confidence
intervals
will so wide and they will contain
1.0. The
alternative is to sort of do a
noninferiority type
idea that you end up the study, you end
up with the
confidence interval, and that confidence
interval
has to be below something like 1.3.
Do you have advice for us if
you did this
sort of second approach? We are dealing with rates
like 1 percent. Could we live with a 1.3 relative
risk that you rule out, a 1.3 relative
risk?
People may be dying if you do that. So how do you
respond to that?
DR. PACKER: I wish I knew the answer to
75
that.
I think that it depends on the type of
adverse reaction. It depends on the particular
drug.
It depends on the vulnerability of the
patient population. All of these need to be
factored together with the actual
feasibility of
doing the study.
The one thing I would say is
that one
learns very little by doing a lousy
trial. So,
doing a good trial is the only way to get
a
reasonable answer or reasonable estimate
of the
answer.
DR. D'AGOSTINO: Just one more. I will
make it quick. In these trials, in many of these
trials, people just won't stay in the
trial. Can
you give us some advice on how to deal
with the
drop-out--now, there are rules that you
could say,
the individual wants to leave, has
decided to leave
because the blood pressure is building up
or
because of G.I. problems building up.
To say, we are only going to
look at that
individual for 14 more days after they
leave, to
me, is a problem because if the blood
pressure is
76
building up, they may be on their way and
it may
take two or three months before they get
an M.I.
and so forth. So you have got the sort of
dropouts, terminations, that are part of
the
protocol but you also have the
individuals who just
stop coming. And they could be substantial. So,
any advice to us?
DR. PACKER: Gee, as you know, when we do
trials for superiority, the effort that
we put into
adherence is extreme. We really want people to
stay on treatment and we organize the
trials to do
everything we can to ethically and
reasonably
maintain adherence.
I take your point that, if the
trial were
a noninferiority trial, it is possible
that the
investigators and sponsor might be less
motivated
recognizing that poor adherence works in
their
favor.
I think that there needs to be a reasonable
effort--I mean, you can maintain
adherence in most
trials if you really, really want to.
DR. D'AGOSTINO: Thank you.
DR. WOOD: I suspect we are not going to
77
solve that problem today. Dr. Shapiro?
MS. SHAPIRO: Just a comment on your
comment.
We all know, of course, that the Federal
Regulations require that participants be
allowed to
withdraw and not be badgered into
staying. But
what I really wanted to talk about was
your
observations about how it is wrong to
suggest that
we should not chase safety quite as
rigorously
because we will, then, deprive ourselves and
others
of information and access to effective
treatment.
I don't think it is as
simplistic as that,
in that, when we are looking at potential
harm or
safety problems, we have to look not only
at
likelihood that it exists but prevalence
and
severity.
So I think that your response
to that
approach has to take account of those
factors as
well.
DR. PACKER: Let me try to reframe my
response.
You can't isolate benefit from risk.
The judgment as to whether a drug should
be used on
an individual basis or on a population
basis has to
78
be the relative value of benefit to
risk. You may
decide that you don't even want to pursue
a safety
trend in a non-fatal event when you know
the drug
prolongs life. That would be a very reasonable
judgment.
On the other hand, you might want to
vigorously pursue a very serious safety
is in a
drug for a symptomatic or cosmetic
condition. So
the risk-to-benefit relationship is the
one that
has to be vigorously defined.
MS. SHAPIRO: Right.
I am sure you will
agree with this; you also have to factor
in
prevalence of the condition and likely
use of that
drug in the population.
DR. PACKER: That's right.
But it is
always--it is risk to benefit. The goal here is
not to say that the risk-to-benefit
relationship
can be altered, simply because you want
to
emphasize one part or another, has to be
in the
context of the clinical problem and looked
at from
the patient point of view.
DR. WOOD: Dr. Cush?
DR. CUSH: I have two questions. One, I
need some education. You were frequently referring
to very wide confidence intervals where
it didn't
79
seem so wide. It was only, like, 0.3 and 0.4
where, obviously, when it ranged from 1.0
to 8.0,
that is very wide. But you used those terms in
both situations. Could you explain the differences
there?
DR. PACKER: Actually, I have used "wide"
to refer to extremely wide, moderately
wide and
wide.
DR. CUSH: And narrow would be--
DR. PACKER: Narrow is less than wide.
DR. CUSH: Okay.
DR. PACKER: Let me try.
All the examples
that I showed you that I characterized as
wide
truly reflected estimates that had a high
degree of
uncertainty associated with it. On the benefit
side, benefits that range from an 80
percent
reduction in risk on the high side to a
20 percent
reduction in risk--remember, and I guess
I should
emphasize this and I guess Tom would
reinforce this
80
dramatically, the concept of how these
curves
looked like in terms of the width is not
symmetrical on both sides of 1.0. The lowest you
can go below 1.0 is 0. So wide confidence
intervals below 1.0 can be 0.2 to
0.8. Those would
be wide confidence intervals. There is no limit
for estimates greater than 1.0, so you
can have 1.0
to 24 on the adverse side of this. So you have to
sort of think about what is wide
differently when
you are looking at estimates below 1.0
than when
you are looking at estimates above
1.0. Maybe that
would be helpful.
DR. CUSH: That does help. Secondly, you
have told us that when we are dealing
with
low-numbers adverse events and that being
very
imprecise and hard to make conclusions
from, is it
even less valid or even greater error to,
then,
take that data derived in one situation,
like in an
Alzheimer's trial, and then try to
generalize that
to the general population?
DR. PACKER: But we do that all the time.
There is a general sense that efficacy is
not
81
extrapolatable across diseases but safety
that is
not disease-specific is extrapolatable.
Let me put it this way. If we didn't do
that, the problem that I put forward
would be
really impossible, really
impossible. So I
actually feel comfortable extrapolating
safety data
across indications as long as the safety
item is
not disease-specific.
DR. WOOD: Dr. Shafer?
DR. SHAFER: Thanks.
That was actually a
very informative presentation and I can
confirm the
distance from Washington to California.
There are really two questions
here that I
think we need to bifurcate. One of them involves
the scientific question of getting at the
truth,
whatever that is. I appreciate everything you say
and, prior to a drug being approved, at
least
ideally, there would be adequate time and
resources
to do exactly what you are proposing.
But there is a second question
which is
how to inform clinical and regulatory
decision
making based on imprecise information
following
82
approval because, in that setting, a
daily decision
is being made by patients and their
physicians as
to whether or not they need to take the
drug.
One question about how to
approach these
sorts of imprecise data when, in fact, a
daily
decision is occurring, is can you take
the
confidence bounds for both the risk and
the benefit
and integrate those over the
public-health hazard
and the public-health benefit to try to
incorporate
the
entire--both the point estimates but also the
uncertainty about them into the
regulatory
decision-making process?
DR. PACKER: Oh, wow.
Just a couple of
comments.
One, the precision of the estimates on
efficacy is almost always more precise,
much more
precise, than the estimates on
safety. So you have
this very precise estimate on
efficacy. You have
this very imprecise estimate, in general,
on
safety.
And you try to sort of integrate them and
you have to now weigh them because it
could be that
the efficacy thing you are looking at is
really
important and the safety is sort of not
very
83
important. Or it could be other way around, the
efficacy is sort of very small--the
efficacy is
small, but the safety is a big risk.
DR. SHAFER: That is exactly the question.
DR. PACKER: You might think that someone
in the world might be clever to create a
statistical model that would allow that
to take
place.
I am actually much more comfortable with
people doing that than statistical models
doing
that.
Somehow, people have the ability to
integrate all of this, especially a group
of people
have an ability to integrate this, much
better than
any mathematical model.
I would be very uncomfortable
if someone
were actually to propose a mathematical
model that
replaced the human, very important human,
element
here.
DR. WOOD: Dr. Farrar.
DR. FARRAR: Every example that I have
seen to date in looking at the risks in
overinterpreting data seem to go from
being a
positive study to a negative study. I wonder about
84
the other way around and whether there
are any
inherent differences in thinking about it
the other
way around, the bottom line being that if
you have
ten studies that show no safety issue
with a
well-measured process, whether you can
then say,
well, maybe the 11th study is going to
show it
somehow.
DR. PACKER: I think you need to find out
how much information there is in each
study, how
easily or how appropriate it is to
combine the data
across the studies to determine how
precise the
estimates, after you have collected and
integrated
all of the data, and put that into a
judgement as
to how much data you actually need to be
confident
about the precision of the estimate.
So there isn't a uniform way of
thinking
about.
It is not like you will know it when you
see it.
There is some guidance, some
mathematical
guidance, that needs to be incorporated
into the
thinking process.
DR. WOOD: Dr. Domanski.
DR. DOMANSKI:
You know, I am not nearly
85
as sophisticated, really, Milton, as you
are about
this sort of thing nor about some of the
people in
the room, but I am a little bit concerned
about
some of the examples. I will give you one. I
don't think ISIS 4 was a definitive trial
of
magnesium, because I know something about
that. We
did the MAGIC study which was a very
large study.
Like ISIS 4, it was negative, but
ISIS 4
was substantially different
methodologically in
terms of when that was given. I think that example
actually, to be honest, is fairly
misleading as a
result.
I think it is an example of a stopped
clock is right twice a day. But, yeah; it came out
right.
But I a worried if that is the
basis for
this--that kind of thing is the basis for
this
discussion across more of the landscape.
DR. PACKER: Let me emphasize,
Mike, that
I knew that if I picked one study and
gave you an
example of one st that I would be at
great risk
because everyone knows something about
these
studies more than what I know about these
studies
86
although some of the studies I actually
mentioned
were studies I was personally involved
with and
think that I know a little more about
them.
So I just wanted to--I would
not
overemphasize--and, in fact, one might
appropriately underemphasize--the
magnesium
example.
But the other examples, time and time and
time and time again. It is just like reaching
conclusions during a very early part of a
study
based on interim monitoring. When you have small
numbers of events, the estimates are very
imprecise
and may not reflect what happens at the
end of a
complete experiment. That is just a general
principle.
I take your point about ISIS 4
but the
number of examples here is just
overwhelming.
DR. WOOD: It is important, Milton, to
remember, we have replication for two of
these
drugs and these safety signals here. So it is not
just single studies.
Dr. Furberg.
DR. FURBERG: Milton, I think that was a
87
great presentation. I think, for balance, it would
be nice if you can have examples showing
the other
side, how trends in smaller studies were
confirmed
in definitive trials. And I know plenty of those.
DR. PACKER: Oh, yes.
DR. FURBERG: That was never discussed.
You are painting a dark picture saying
you can't
trust smaller studies. You are right. You never
know where you are going to end up and
you need to
be careful. But don't say that you can't rely on
those.
DR. WOOD: I was actually on the advisory
committee that turned down Vesnarinone,
that looked
at that study. There were lots of issues that came
up at that time that led us to do
that. So it
wasn't just that there was a study that
was
compelling and that people went with
that.
Dr. Nissen?
DR. PACKER: Curt, let me just say that--I
think your point is very, very
important. What I
have not done is shown many, many
examples of
interim monitoring in trials where the
early
88
results were reflective of the
endpoint. I have
not
shown a whole host, probably more than I could
think of, of all of the pilot trials
where the
initial trends encouraged someone to
pursue it and
that the second study was, in fact, very
confirmatory.
Let me just make my point
clear. It is
just not as reliable as we think it
is. It is not
that it is worthless. I do not want to say that.
If I have implied that, then I do not
want to imply
that.
I just want to say that the risk of error
early when you have small-number events
is much,
much greater than when you have a much
more precise
estimate at the end of the trial.
My plea here is that when you
don't know,
the best thing you can do is say, "I
don't know."
And that is my only plea.
DR. WOOD: Milt, when you have two trials
that replicate one another, with a
p-value of less
than 0.05, if that was an efficacy
endpoint, we
would approve on the basis of that;
correct?
DR. PACKER: That's right.
DR. WOOD: But you are telling us that,
when it is a safety endpoint, we should
not act on
that.
I think it is counterintuitive.
89
DR. PACKER: No, no, no.
DR. WOOD: Hang on.
That seems to me
counterintuitive. We have, for two of these drugs,
two randomized trials that replicate the
outcome.
In three of the four trials, the outcome
was
predefined, adjudicated and so on. That is about
as good as any drug that has been
approved on the
U.S. market that I can think of.
DR. PACKER: Let me just add one
dimension, Alastair, to the thinking
process and
that is that when you have a p less than
0.05 on
two trials, on the primary endpoint
because it is
efficacy, you have two trials that were
designed
for the endpoint and have fairly narrow
confidence
intervals and precise estimates.
That is not the same concept as
having a p
less than 0.05 on two imprecise estimates
which are
combined together.
DR. WOOD: No; I understand that very
90
well.
I think we all do. The issue here
is both
of the second trials--both of the second
trials--were designed to test the safety
issue that
was in the first trial even though they
were
efficacy studies. So it is not like they were just
two trials that fell on the ground from
Mars that
arrived with something. These were designed, at
least according to the sponsors, to check
for that
outcome.
So I think you are overselling
the point a
bit.
Let's move on. Dr. Jenkins?
DR. JENKINS: I found the presentation
very interesting and I wanted to probe a
little bit
further on the APPROVe study because that
is the
one that I think we were feeling very
comfortable
with the finding in APPROVe. Yet, I went back to
Merck's presentation, and their
prospective plan
was actually to combine three studies
that were
going to be placebo versus rofecoxib in
three
different populations.
Their plan was to have 25,000
patients to
91
evaluate the cardiovascular signal. Now, in
APPROVe, presumably, they had stopping
rules that
the Data Safety Monitoring Committee saw
an extreme
effect that met those criteria so they
stopped the
study.
But I am just interested in hearing your
thoughts about how should we interpret
APPROVe
where the stopping rule is met for an
individual
study when the prespecified plan was to
have three
studies combined for 25,000 patients.
DR. PACKER: Gee, I must say that I am
delighted to have everyone ask me the
hard
questions for this afternoon. I sort of think that
this is what this committee has to
do. I only
wanted to add a dimension to the thinking
process
here. I
don't come with any answers on how to put
all of the data together. All of the points on how
to synthesize these data, I am very
comfortable
with the human process of doing so as
long as the
human process incorporates an
understanding of how
difficult and imprecise this is and the
fact that,
in the past, although it has led to
predictions
that came true, it also led to
predictions that did
92
not come true.
DR. JENKINS: I think, more specifically,
the point I was trying to get you to
comment on is
not the overall interpretation of the
rofecoxib
data but the fact that there was a plan
for 25,000
patients in three studies. What I am trying to
understand is how should we, then,
interpret a
finding from one of those three studies
where an
interim analysis crossed the stopping
boundary and
met the criteria for stopping the
study. What
weight should we give to that finding in
that
single study?
DR. PACKER: I don't think there is a
precise answer to that. Any time you deviate from
your preplanned attack on the conduct of
analysis
of a trial, you weaken, to varying
degrees, the
precision of the estimate and the
confidence you
have in the data that you are looking at.
DR. WOOD: Dr. Nissen?
DR. NISSEN: Milt, there is an additional
subtlety here. Let me see if I can drill down with
you on it. What we have here is a class of drugs
93
where we have multiple trials within the
class. So
what we are asked to do is not
necessarily, in some
respects, for each individual drug, say,
well, do
we have replication or not.
But if we take the position
that this is a
class effect, then we have got four, or
perhaps,
five trials. This came up once before. It was
kind of controversial. I think you may have been
on the committee at the time when we had
the
angiotensin-receptor blockers for renal
protection.
What the two companies did with two
different drugs
is they stipulated that the other could
use the
data from the other company's trials as
supportive.
So the reason that this is
really much
harder is that we have a lot of trials
here. We
may not have reached all the evidence in
an
individual drug, but we have trials
across the
class of drugs. I wonder if you have any thoughts
about this because it is obviously a
difference
between studying a single agent and
studying a
class of agents.
DR. PACKER: I think that, Steve--I mean,
94
that is why the process works best when
there are
human beings involved in the thinking
process.
There is no predetermined sense that one
should
bring to the process--that you confine
the analysis
only to one drug. What you should allow yourself
to do is look at the data with one drug,
look at
the data with drugs that you think are
related.
If there are data that you
think are in a
drug that really isn't related, you might
want to
analyze that separately or do it both
ways to see
if it is consistent. There is no statistical
formula that can guide the very important
human
process here.
My major point is that the
precision that
most clinical investigators think exists
here isn't
as precise as we think it is. But that doesn't
mean that you--and Curt would emphasize
this--that
doesn't mean that you can't put together
your own
picture of the totality of the data and
bring to it
a sense of whether it reaches some
critical level
of concern.
In the absence of precision,
you have got
95
to do that. But don't forget inherently that the
data are imprecise.
DR. WOOD: Curt, do you want to say
something else? No.
Then let's move on. The next
speaker is Bob Temple who we are going to
confine
to his seat.
DR. TEMPLE: Alastair, I have a question.
What am I supposed to do about my
slides? Can
someone show them for me? I will delete many of
them.
DR. WOOD: Okay.
You can come up here if
you do it quickly.
DR. TEMPLE: I don't care where I'm from.
I really don't.
DR. WOOD: Then Kimberly will work the
slides for you.
DR. TEMPLE: Okay; if Kimberly will do
that.
Issues in Projecting Increased
Risk of
Cardiovascular Events to the Exposed
Population
DR. TEMPLE: I was not in any way trying
to address the main issues the committee
is
96
grappling which is about what to do about
these
drugs.
But it seems to me you can't help noticing
that there is some data we would all like
to have
that we don't have and that is what I was
trying to
address.
Obviously, the main thing we
are worried
about is the effect of the
COX-2-selective NSAIDs
on cardiovascular outcomes, notably
death, stroke
and heart attack. But are particularly interested
in the single drug effects, whether they
are all
the same.
We are interested in whether we are
looking at true class effects of
differences.
We also can't help noticing
there is not a
lot of long-term data on the nonselective
NSAIDs
and, of course, has been pointed
repeatedly, some
of them are sort of selective anyway.
There is major interest in
possible
differences in the subpopulations that
might be a
different risks. I think there are mechanistic
considerations, how much of this is
really likely
to be platelets and could there be a
blood-pressure
effect.
The importance of that, to me, is that it
97
is not quite clear what to do about
platelet
effects, but, conceivably, you could
manage a
blood-pressure effect if that was a
problem.
There is a lot of importance
and interest
in the dose and dose interval. And it is important
to think about how long studies have to
be to
detect these things. Obviously, some of trials
seem to have shown things in a matter of
seven or
eight months. There is some suggestion that some
of the effects need much longer to
detect.
Skip the next one.
With respect to cardiovascular
effects,
the main question is whether everything
is really
answered.
You know, there are lots of studies, as
Alastair was pointing out. They are not perfectly
consistent, maybe, but there are a number
of
studies with a number of drugs that seem
to be
showing the same thing.
I guess, to me, they don't seem
entirely
consistent. There are a number of possible reasons
for that.
One is that there really are differences
between drugs, or at least between
doses. Another
98
is that even the best controlled studies
sometimes
give different answers. Another is that small
effects are difficult to evaluate in
epidemiologic
and even controlled studies. Then the last is that
the effects may be
population-dependent. That has
been discussed.
So it does seem to me there is
more to
learn.
Skip the next. We all know
that. Platelet
effects.
One of the things that seems
important to
pin down and I don't think it has been
pinned down
yet is the possibility that blood
pressure is a
significant part of all this, that there is some
impression that Vioxx has bigger
blood-pressure
effects than the other drugs, but I don't
think
there is what we would call adequate data
on the
effects of all these.
By adequate data, I mean data that
gives
you information about the effect of drug
over the
entire dosing interval, that has pinned
down dose
response and that has pinned down the
effect of
different dosing intervals. There is an
99
impression, though, that these drugs can
reverse
the effect of other anti-hypertensives,
perhaps,
especially, ones that work through the
renal and
angiotensin system. They seem to have, at least
some of them, an effect on blood pressure
generally
and then there are isolated reports of
hypertension
in trials reported as adverse reactions,
clearly
more common in the treated groups.
I have a bunch of slides
showing that
elevated blood pressure is bad for
you. You can
deduce that from epidemiologic effects,
from a
mountain of clinical studies. The most recent
study that of interest, which I will not
describe--keep going--in detail is a
study that
Steve Nissen knows about called CAMELOT
which you
can read as saying that a change in blood
pressure
of even 5 millimeters of mercury systolic
and 3
diastolic might have a reduction of about
33 percent in the kinds of events we are
talking
about in people whose diastolic pressure
is only
about 100.
That is not definitive. This is a subset
100
of the data and you can look at my slide
to see
what I did.
As I said, we don't know as
much about the
blood pressure as we should.
So a crucial question is in the
larger
assessment of cardiovascular effects;
what can we
really study more. My own view is that, given
VIGOR and fairly consistent epidemiologic
findings,
it would be difficult to study 50
milligrams of
rofecoxib. I doubt you could write a proper
informed consent.
I take Milton's concern to
heart but I
guess my own view is there is probably
enough
information about that. But what you could with
respect to other things depends on what
you
believe.
Suppose you believe that the
cardiovascular risk of 200, 400, of
celecoxib is
not entirely clear. One polyp study says yes and
other studies are not so clear. And you believe,
also, that a class effect is uncertain or,
more
particularly, that the effect might not
apply to
101
certain doses and certain dose intervals
even if
you are inclined to believe that the
class does
have a problem.
If you also believe that more
needs to be
known about the long-term use of all
NSAIDs,
including those that are nominally
COX-2-selective
and those that are not, if you believe
that new
COX-2-selective agents conceivably could
be
developed with appropriate information,
and if you
believe the pharmacology gives hypotheses
that need
to be tested, not necessarily just
believed--sorry
Garret--then here is what you might be
able to do.
Again, I am not, in any way,
saying who
should do this. This will be a massive
undertaking. But it does seem to me that there is
information we all collectively need as a
community. So I am calling it an ALLHAT study for
anti-inflammatory drugs.
This is just one of what people
could
dream up as what might be compared. The drugs, it
seems to me, one might think about
putting in it
include ibuprofen, which we think
probably ought to
102
be neutral, not bad. It may not have the platelet
effects you want. Naproxen--I am embarrassed to
say
this but I am letting myself be affected by the
epidemiology studies. Naproxen sort of looks good.
You might even say it is at least a
placebo, but I
am not quite ready to say that.
Diclofenac seems a good model of
a regular
NSAID that is really COX-2-selective, at
least to a
degree.
Celecoxib possibly at more than one dose,
although, maybe for caution, one would
want to
think about the lower dose first. Then I have two
other groups that I will be interested in
people's
comments on, and I am not totally sure
you could
bring these off.
But could one include an
aspirin full-dose
study.
We know it is an effective agent in
arthritis accompanied by a proton pump
inhibitor.
Now, you would have to first show that
proton pump
inhibitors really do block the
ulcerogenic effects
of aspirin. That is a short-term study and maybe
one could do that. So I will be interested in
whether people think you can bring that
off.
The reason for doing it is we
know the
effects of aspirin are not unfavorable
and we think
they are probably favorable in at least
many
103
populations, in populations at high risk
and
probably not unfavorable in people at low
risk.
The last one that seems worth
considering,
and my understanding is that, in many
parts of the
world, at least osteoarthritis is treated
this way,
to use acetaminophen plus codeine added
as needed
and try to do something about the
constipation.
That would be as close to a
true placebo
group as I think you can get in a setting
like
this.
So it seems quite interesting.
It is worth saying if one had a
new single
agent, my suggestion, and one still
thought that
drugs like this should be developed, that
the
single agent might be compared to naproxen and
I
would still hope for one of the other
last two
comparisons as a true placebo.
Obviously, these are all people
who need
chronic pain medications. You would want O.A. and
R.A. stratified. I don't believe you could use the
104
APAP group for rheumatoid arthritis but
others may
not agree with that. You probably want to study a
range of cardiovascular risks but you
probably
would want to study the lower-risk people
first.
The reason I say that is anyone
with known
coronary-artery disease really has to be
given
aspirin just because that is part of
treatment and
it isn't clear yet, to me, how aspirin
interacts
with the COX-2-selective drugs. You would think it
would make them unselective but the data
don't seem
to necessarily say that.
A good question is how big the
sample
would have to be and that depends on what
you want
to find out. If you are really trying to compare
the drugs with a true placebo, they
wouldn't have
to be that large to rule out, say, a
two-fold risk
or
something like that. We have seen
studies with
about 1,000 per group that have
distinguished
between drugs. So that is not so huge.
But if you really wanted to get
at whether
one drug is a little bit different from
another,
you are talking about studies of massive
kind. I
105
have asked various numerically qualified
people and
the general impression is that if you
wanted to
rule out a 20 or 30 percent difference,
you are
talking about 50,000 per group. That is beyond my
hopes even for ALLHAT 2.
Obviously, the outcomes of
major interest
are cardiovascular death, stroke, AMI and
bleeding.
I have heard some thoughts that maybe
heart failure
should be looked at in addition but I
wouldn't make
that the primary endpoint. I think you can look at
that separately.
A big problem is what to do
about blood
pressure.
My first thought was that you would
monitor it and treat anything over 120
over 80, but
that really isn't standard practice. So a question
I would raise is whether one could leave
people to
go to 130 over 90, would that be
acceptable.
A question one could raise is
why do this
at all?
Do you really need these drugs?
We have
heard fairly strong feelings that G.I.
intolerance
is not trivial. But my answer is more that we
really don't know enough about the whole
range of
106
these drugs. There is no question that people are
going to get something for their
arthritis. I am
not entirely comfortable with looking at
the data
and saying we know what we need to.
You could sort of deduce that
naproxen
usually looks pretty good. It usually beats what
is there except we just heard about a
study where
it was a little worse. But it is not clear where
ibuprofen comes. It doesn't show the same thing.
It seems to me there is a serious
population need
to find out about these things and to
understand
more whether all selectivity is the same.
We have been through diclofenac
at length
and it is not clear what one needs. So I think the
idea of doing a large study has weight.
If you believe that it is
really all
settled, that cardiovascular risk is
clearly
increased with all of the COX-2-selective
agents,
ignoring for now which ones are actually
selective,
there still are things one might want to
know.
It might be of interest to do a
study that
still would have the ibuprofen and
naproxen groups
107
and might still have my aspirin or APAP
groups.
One might consider trying a celecoxib
with the
addition of aspirin. I know the results of that
have not shown that any adverse effect
seems to be
mitigated, but that still doesn't make
much sense
and it might be something one could still
want to
test.
It would seem that if you added aspirin to a
selective agent, you ought to have a de
facto
unselective agent. Of course, that presumes
mechanism and you shouldn't presume
mechanism. You
should test it.
Anyway, those are my thoughts. I think my
main point is that there is really a very
important
need for better information on the whole
array of
these drugs and the kind of study needed
to do that
is mind-boggling large. However, people are
already undertaking studies with 25,000
and 30,000
patients already. So it is not as outlandish as I
would have said it was before we started
this
process.
Thank you.
DR. WOOD: Okay.
I am just interested,
108
why didn't you suggest a PPI with
naproxen? For
your ALLHAT study, why didn't you suggest
a PPI
with naproxen?
DR. TEMPLE: That is a fair question. I
think the answer on--what did I suggest
it with?
DR. WOOD: With aspirin.
It doesn't
matter.
DR. TEMPLE: I will tell you the reason.
Full-dose aspirin is just plainly
impossible to use
because of massive G.I. intolerance. I believe,
historically based, it is worse than we
expect with
naproxen.
So I thought you had to do it there
urgently.
You could do it with naproxen, too.
That would be okay.
I have to point out that we do not have
definitive labeling or evidence that
those drugs
really do prevent this but we have heard
about some
studies that suggest it. I do think that is an
early thing to discover.
DR. WOOD: Okay.
Understood. Let's move
straight on to Bob O'Neill's presentation
who also
is going to do it from his seat.
Issues in Projecting
Increased Risk
of Cardiovascular Events to the
Exposed Population
DR. O'NEILL: I won't go through the
109
slides.
I might point your attention to a few of
them.
I will try and do this in five or ten
minutes.
DR. WOOD: Do you want us to have the
slides up, Bob?
DR. O'NEILL: What I was asked to do is
essentially provide a framework. This is a very
difficult problem of projecting risk to
the
population. Very little has been published about
how to do this appropriate so I was
intending to go
through sort of the logic and the
framework of how
you might think about this.
It requires the integration of
exposure
data at the national population level and
it needs
information relative to how long people
are on
drugs and it uses information from the
clinical
trials as well as from the epidemiology
studies to
the extent that they are relevant to the
question
that is being asked.
This is a very difficult
problem. It was
not intended to give any estimate, any
single
number.
It was intended to show how hard it is to
get there and, at the end of the day, how
variable
and sensitive the estimate might be to
all the
assumptions you have to make.
110
So I used the Vioxx VIGOR and
APPROVe
studies as an example of the process that
one might
go through. I made the point that event
definitions and many things matter. But I guess if
there is anything that I would like
people to take
home is that time matters. The hazard rate
matters.
And the hazard ratio matters as a
function of time when you do any of these
projections.
I would just recall two
slides. One would
be the VIGOR study which is Slide 12 so
that
everybody could remind themselves and
Slide 16.
The VIGOR study shows a separation of
curves.
Behind that is what is called a hazard
rate. I
believe the data supports that the
escalation of
the risk increases with duration of
exposure.
111
Merck and we have talked about this in
the past and
sort of have different views of this, but
we seem
to feel that that risk does escalate.
That does not mean that there
is no risk
in that picture early on. I think David Graham has
made this point that it may be a power
issue but,
nonetheless, it is what it is and I am
not
convinced that the epidemiological
studies at this
stage add anything to our knowledge about early
risk for the points I made yesterday
because I
think time zero matters in terms of
looking at the
risk, in terms of how long you are on.
The next slide is Slide 16 which
is the
APPROVe study. Similar pattern, only delayed a
year.
So instead of the curve separating at
approximately six months, four months,
they
separate a little later on. The idea here is that
the relative risks that are summary
relative risks
for both of these trials, for VIGOR, for
thrombotic
event, it is approximately 2.28 and, for
APPROVe,
it is approximately 1.92 for confirmed
thrombic
events is an average relative risk
averaged over
112
all the time points so that the relative
risk at
different times is a function of time.
That is an important concept
when, then,
you go and you look at the national
projection of
how many people are exposed for how long
a period
of time.
I won't go through that because they are
in the slides. But we have no data in the United
States to do this. So we did a projection based
upon the IMS National Prescription data,
another
separate database that allowed us to look
at how
long exposure, success of exposures,
might be to
get an idea of how long individuals may
stay on the
drug.
Surprisingly enough, a very small
percentage of the millions of people that
are
prescribed the drug are on the drug for
more than a
year.
That is in one of the slides on the
Caremark.
So what this meant is you multiply all
these estimates which, essentially, are
time. We
calculated a time-specific difference in
absolute
incidence rates for the different trials,
made a
projection and essentially used in that
projection
113
a number of assumptions many of which are
not
verifiable, and then came up with some
crude
estimate of what might even be an upper
bound on a
confidence interval for any estimate.
We probably don't believe it
because there
is no real methodology to support that
estimate but
nonetheless to say that an estimate is
very
variable.
So the bottom line, and the
conclusions
here, given the time frame, is that
purpose of the
projection effort was essentially just to
provide--this is the last slide; it is
Slide 47--it
is essentially to provide a framework for
considering how you would think about
developing an
estimate and to provide a range of
estimates and,
also, essentially, to point out that
there are many
limitations to any estimate that you
would provide.
We are not supporting any, or
putting
forward any, one estimate but I do believe that we
need to understand this problem by moving
away from
summarizing nonproportional hazards in
person
years.
It is not a good idea. It begs
the
114
question as to whether the risk is
constant or
whether the risk is dependent on time.
If there is one problem with
the
epidemiological literature, it constantly
reports
person-year risk as opposed to every one
of the
clinical trials we have seen presents a
Kaplan-Meier curve that looks at the
time-dependent
risk.
Unless you understand that, you can't come
to grips with comparing one drug to
another.
You can't come to grips with
comparing a
drug to itself. If you look at the VIGOR study
relative to the approved study, they are
in
different populations. One is in a population of
R.A.
The other is in a polyp prevention trial.
One
is at 50 milligrams. The other is at 25
milligrams.
There are many things that need
to be
sorted out. So the point here is that this is a
very difficult exercise to project. This was just
a framework to say, here is how you might
think
about it.
Most of the estimates are fraught with a
lot of danger and have to have many
caveats placed
115
on them were you to bank on any one
estimate alone.
That is pretty much my bottom
line.
DR. WOOD: Bob, just to make sure
everybody in the audience understands
what you are
talking about with estimates, what you
are talking
about are absolute numbers of people--
DR. O'NEILL: An estimate of the absolute
numbers of individuals that might have
been at risk
and had these events if they were
exposed--if they
were exposed. This is a model projection.
DR. WOOD: Right.
I just wanted to
clarify that. So it is not the relative risk. It
is not the same as what Milt was talking
about.
DR. O'NEILL: Right.
Exactly. This is a
long discussion to get into the concept
of
attributable risk in its own right. Given the
time, I wouldn't be able to do that.
DR. WOOD: So you are talking about the
number of people, these sort of numbers
that are
out there.
DR. O'NEILL: Right; to go through
that
exercise.
It is hard enough to interpret a single
116
study or a collection of studies. To go to an
estimate of what the increased number of
events
might be at the exposed level is what
this effort
was about, all the different, five
different
separate interlinked but disparate
databases that
you would need to get there to make this
kind of an
estimate.
DR. WOOD: Okay.
Good. Thanks.
DR. WOOD: We will take a few minutes, a
very few minutes, for questions to the
last two
speakers and then we will take a break
and be back.
So the panel needs to remember that they
are eating
into their break.
Dr. Nissen?
DR. NISSEN: Quickly, Bob, Bob Temple.
The difficulty, of course, in the ALLHAT
study is
that it is very--it seems unlikely that
it will get
done.
So the question is, putting some constraints
on this, and I thought about this last
night in
some detail into the wee hours of the
morning, it
seems to me that what we really need for
this class
of drugs is a reference standard. That reference
117
standard, unlike many studies, can't be
placebo
because you can't treat arthritis
patients with
placebo.
So I would submit to you that, if you are
going to do comparisons, that the
reference
standard, the best reference standard we
have, is
naproxen because we know as much about it
as
anything else. We think it is, at worst, neutral
and maybe a little better than neutral.
So I would argue that, if you
want to do
ALLHAT light, then what you do is you
test every
agent both that stay on the market and
that are
proposed to bring onto the market against
naproxen
with an adequately sized trial and you
set an upper
bound, which we have to talk about, about
what the
upper bound of hazard you are willing to
accept is,
and the test that you run is on efficacy
and on
cardiovascular hazard.
If your drug is beaten by
naproxen, you
don't make it. If you can show equivalence within
a reasonable upper bound of naproxen,
then we would
be pretty comfortable--I think I would be
pretty
118
comfortable that the drug is not going to
create a
hazard.
What do you think about that
strategy?
DR. TEMPLE: That is actually--I went
through it very fast, but that is
actually what I
said at the bottom of one slide. I still would
like to know better whether the naproxen
is less
bad or is really good. Therefore, as I said on the
slide, in my heart, I would like to see
somebody
try to give full-dose aspirin for a while
because
we are really pretty sure that won't be
bad.
I think the community, in the
long run,
needs that. Who is going to do it? That is a
perfectly good question. I do want to point out,
though, that the way some of the trials
were done,
like TARGET, they could have given
answers on some
of this, or at least closer. But, because they did
separate trials, instead of randomizing
to each of
the treatments, that was obscured.
You could have had a very
substantial
naproxen-ibuprofen comparison, but you
didn't get
it because of the structure of the
trials. So I
119
think it is very important to randomize
to each of
the treatments, obviously, whatever it
is. But
that would be my best guess at the
moment. But, in
line with what Alastair asked before,
when you do
naproxen and you are looking at G.I.
effects, do
you add a proton pump inhibitor? I think you need
a little more information before you do
that, but
you might say that, which then raises the
fundamental question of how much help you
get from
being COX-2-selective.
DR. WOOD: Dr. Cryer?
DR. CRYER: I wanted to comment on several
of the questions, Dr. Temple, that you
raised as
well to ask a question. I guess I will just ask
the question first. When you say "full-dose
aspirin," are you referring to full
anti-inflammatory doses of aspirin, 3.9
grams a day
or--okay.
DR. TEMPLE: Which I assume most people
will not tolerate and there will be huge
bleeding.
So you have got to do something.
DR. CRYER: Right.
See, I think that is a
120
non-practical experiment design and I
think we have
come a long way from 3.9 grams of aspirin
per day,
particularly because of the concerns of
the adverse
events, the silicysm, the G.I.
events. Clearly,
100 percent of those people are going to have
gastric ulcerations assessed
endoscopically.
So I also would prefer one of
the newer
NSAIDs, traditional NSAIDs, in that
comparison.
With regard to--
DR. TEMPLE: Actually, before you leave
that, do you know what would happen if
you added a
proton pump inhibitor to aspirin?
DR. CRYER: Not at 3.9 grams a day. I
don't think anybody thought that would be
a
feasible design.
DR. TEMPLE: Short term, then, just to
look at endoscopic ulcers.
DR. CRYER: I don't know and I don't think
that it will ever be known.
DR. TEMPLE: Then I won't get the answer.
DR. CRYER: What I do know is that, if you
give 3.9 grams of aspirin per day in the
121
short-term, greater than 90 percent of
your
patients who take aspirin will have
endoscopic
ulceration. I don't know what the effect of the
PPI would be.
I wanted to address your last
kind of
question that you threw out there of
whether or not
a short-term study would show that
celecoxib plus
80 milligrams of aspirin would have a
favorable
effect, a G.I. effect, compared to a
non-selective
NSAID.
Those experiments have been done.
With respect to endoscopic
ulcer, COX-2
plus aspirin equals traditional
NSAID. With regard
to hospitalizations, having said that,
there is a
recent study not yet published,
epidemiologic study
from Canada, indicating that COX-2 plus
aspirin,
hospitalizations for that are less than
hospitalizations for non-selective NSAIDs
plus
aspirin.
Then we have outcome studies not yet
fully published in the abstract form
which indicate
that events on COX-2 plus aspirin are
similar to
events on non-selective NSAID plus
aspirin--G.I.
events.
DR. TEMPLE: It is possible that if you
add aspirin--I mean, it is sort what I
would
expect--is that you would get something
that is a
122
lot closer to being--in a cardiovascular
sense, a
lot closer to being just a regular NSAID
and maybe
you would still have some residual
advantage in a
G.I. sense.
But, I must say, the data so
far don't
show that. But they didn't seem definitive to me.
It raises the question of--you
know, the
idea of COX-2 selectivity is, at least,
in part, a
conceptual and promotional idea. As Garret pointed
out the first day, five or six of those
old drugs
that aren't coxibs are
COX-2-selective. So there
is a whole range. My feeling is we need to
understand the consequences of what all
that means
and there is a somewhat artificial
separation
between the coxibs and the others because
those old
drug at least are partially selective and
may have
some of the same properties.
So one of my hopes that we
could look at a
range of these.
DR. CRYER: With respect to your last
comment, I am entirely in agreement with
that.
DR. WOOD: Let's move on. Dr. Cush?
DR. CUSH: ALLHAT, I like the intention of
it.
I would suggest, though, that if you are going
to have a study long enough to pick up
some of
123
these events, a year or two, it is going
to be
very, very hard to keep O.A. patients on
one of
those drugs.
So maybe actually stratifying
according to
pure COX-2-specific drugs to
COX-2-selective drugs
to the non-selective drugs that are more
predominantly COX-1 and then having a
totally
nonsteroidal, non-nonsteroidal group,
which would
be the Tylenol group you talked to or
other
analgesic agents might work over the long
term.
DR. TEMPLE: That would answer a lot of
the questions. My real hope--you have a better
idea whether it is possible than I do--is
that you
could actually find a population that
could be
given what we are pretty sure is a
cardiovascular-neutral treatment. That is really
124
the only way to pin this down and it does
seem
worth pinning down.
DR. WOOD: Dr. Hennekens?
DR. HENNEKENS: I think I gleaned from Dr.
O'Neill that if we determine there is a
class
effect that it varies not just by drug
and dose but
by duration of therapy. From Dr. Temple, the
comment that--I am very attracted to the
concept of
what I would call a large simple trial
rather than
an ALLHAT trial. I think there is merit in seeing
aspirin studied in therapeutic doses and I think
there is evidence that anti-inflammatory
effects
are seen a doses far lower than the 3.9
grams.
But the question I have for Bob
is there
are three currently marketed FDA-approved
coxibs.
So would you include valdecoxib and 25
milligrams
of rofecoxib in your design?
DR. TEMPLE: Part of the reason I didn't
address that is I figured that is what
the
committee is going to talk about. I was willing to
say that the celecoxib data look funny
enough so
that you might consider it.
DR. WOOD: That is part of what we are
going to discuss.
DR. TEMPLE: That is what you are going to
125
discuss so I didn't address it.
DR. WOOD: Let's move that to later. Dr.
Domanski?
DR. DOMANSKI: I will pass.
DR. WOOD: Dr. Abramson?
DR. ABRAMSON: Thank you.
I want to
probably say something rather naive in
support of
the study, Bob, and that is that we are
at a moment
where we can do a paradigm shift, meaning
that
study that you propose is an important
one but it
is very large and it is going to be very
hard to
get any resources to do that.
I think we are at a moment
where for the
companies and the FDA and the government
to think
about a collaborative study where, if you
have a
drug that has some--this information is
important,
that we put together a collaboration
among industry
to do a multi-arm study of multiple
drugs. It is
something, you know, in the
osteoarthritis field,
126
the companies have supported largely this
osteoarthritis initiative through the NIH
to look
at outcomes in large numbers of patients.
I think what we need is a similar
COX-2
initiative where either with the FDA or
the NIH
participating, with collaboration among
industry,
we are doing a multi-armed large study
with
biomarkers, with pharmacogenomics
studies, with
genetics and other blood pressure, but
try and do
it in a utopian way.
I think everyone here wants to
get the
right answer, whether it is in industry
or here at
the table. This could be a good opportunity to do
something very differently than we have
done before
in a large trial.
DR. TEMPLE: I don't disagree at all. I
mean, some of the drugs are generic. They don't
have any company that is massively interested
in
them.
So it is going to be a mixture of
government, generosity and a wide variety
of other
things that are scarce. So I don't know how
to--you noticed I didn't have a slide on
how to do
127
this.
DR. WOOD: Dr. Ilowite?
DR. ILOWITE: Just a minor point. I
understand the need for a
cardiovascular-neutral
anti-inflammatory drug in an ALLHAT study. But I
was a little confused because I am aware
of some
literature directed at people who are
interested in
Kawasaki disease suggesting that
high-dose
anti-inflammatory aspirin is actually
prothrombotic
because of differential effects on
prostacycline
and thrombotics.
DR. TEMPLE: There are aspirin studies
going back to at least moderate doses
that show
beneficial effects. It is not just 80 milligrams.
It is certainly at least a gram a
day. Some of the
early ones were more than that. That is worth
thinking about. I am encouraged by the thought
that you might be able to get away with
doses less
than 3 grams. So I didn't know that it was
considered prothrombotic. I thought aspirin always
looked good. But that is not up to grams. I don't
think any of the studies have done
anything like
128
that.
DR. WOOD: We will give Dr. Fleming the
last word.
DR. FLEMING: I am just debating whether
to do it now or after the break.
DR. WOOD: Let me help you. Go ahead.
DR. FLEMING: Now?
DR. WOOD: After the break will be great.
DR. FLEMING: All right.
I will wait.
DR. WOOD: We will take a break and then
we will be back here in ten minutes.
(Break.)
DR. WOOD:
Okay, folks. Let's get
started.
The next presentation will be given by
Sharon Hertz who is Deputy Director of
the
Division.
DR. HERTZ: Thank you.
I am just going to
spend a very few minutes summarizing some
of our--
DR. WOOD: Let me, in fact, just before
Sharon begins--Sharon Hertz has passed
out a
handout that includes a lot of her
slides. In the
interest of time, she has graciously
agreed to
129
delete some of these slides and just
focus on a
smaller subset of what is in the handout.
However, the committee does
have the
handout and the committee may find that
handout
useful for referring to some of the data.
DR. HENNEKENS: Alastair, a quick comment.
I want to make a quick clarification on
the earlier
comment about pro-inflammatory effects of
high
doses of aspirin.
DR. WOOD: Sorry; I missed that. About
what?
DR. HENNEKENS: In the randomized trials,
135 randomized trials with over 212,000
randomized
subjects, whether the doses of aspirin
are 75
milligrams or up to 2 grams a day, there
are
significant cardiovascular benefits to
aspirin even
at high doses. The issue, as Bob pointed out, at
the high doses, is not that there is a
reversal of
the benefit but that the side effects are
increased.
So I think that is an important
point to
make.
DR. ILOWITE: I just wanted to say that in
pediatrics, we think of anti-inflammatory
doses as
100 milligrams per kilogram. So those are the
130
doses I was speaking of.
DR. GIBOFSKY: Finally, the high-dose
aspirin that would be necessary to treat
patients
with rheumatoid arthritis of 3.9 grams or
greater
would have significant problems on the
stomach, as
Dr. Cryer said, significant problems on
the hearing
of the patient and significant problems,
perhaps,
on other organ systems as well. It is not a study
that could be easily undertaken.
DR. HENNEKENS: I won't debate the value
of the study of 3.9 grams of aspirin but,
from the
perspective of anti-inflammatory effects,
they have
been observed at doses of 2 grams of
aspirin a day
and, in fact, there are randomized
studies going on
directly comparing that somewhat higher
doses of
maybe 1 to 1-and-a-half grams a day might
have
significant anti-inflammatory as well as
anti-atherogenic effects as measured by
endothelial
function, nitric oxide formation and
other
131
parameters.
So I don't think that the
traditionally
high doses are the ones that necessarily
would need
to be done. But I don't want to debate whether we
should be studying doses of 4 grams of
aspirin.
DR. WOOD: What you are telling us,
Charlie, is that you are comfortable that
there is
an antithrombotic effect at the high
doses of
aspirin.
Is that right? Okay. Good.
Dr. Cush wants to say
something.
DR. CUSH: Again, you need not
anti-inflammatory doses but analgesic doses
which
can be substantially lower. I do want to make a
statement with regard to a study that
wasn't
presented here that I think is germane
and we
should know about it, and this is
quick. There is
a very large trial that is NIH supported
that is
called the GATE study, glucosamine in
osteoarthritis of the knee.
This is a 1588 study that is
completed and
is currently being analyzed. That Data Safety
Monitoring Board of the study has
analyzed it for
132
cardiovascular risk because there is a
Celebrex
arm.
There are five arms in this 1500-patient
study; placebo, Celebrex 200 milligrams
once a day,
glucosamine only, chondroitin sulfate
only, and
glucosamine and chondroitin sulfate.
The outcome here, in a
six-month trial, is
pain reduction in osteoarthritis in the
knee.
Because of all this press and what not,
they have
looked at the safety outcomes and they
have not
shown any increase in cardiovascular
events
including M.I., any difference between
the Celebrex
group and the other four control groups.
DR. WOOD: Let's move on to the program.
Dr. Hertz?
Summary of Meeting
Presentations
DR. HERTZ: There are now several versions
of my slides around and you are free to
look at
whichever interests you. There is one correction
on the lumeracoxib slides from the
original set
where I substituted the word diclofenac
for
ibuprofen. So those of you looking at those slides
just be aware of that, please.
What I am really just going to
do now is
just focus down again some of the reasons
why we
are here.
This would not be the current slide set.
133
Any help here?
Looking at the most recent set
that were
handed out, and we will just work from
there
because there is not a lot of data
anymore to
present, but, basically, I want to just
point out
that we are here because we do recognize
that pain
drugs are critically important, that the
COX-2-selective NSAIDs have been
extensively
studied and there are, over time, studies
that
revealed new potential uses as well as
new risks.
We need to determine how we feel
about
these risks. Are they limited to individual
products?
Are they applicable across the group of
COX-2 selectives and how far does this
extend to
the nonselective anti-inflammatories.
There is a slide that
describes--
DR. WOOD: Sharon, apparently everybody
has hard copies of your slides.
DR. HERTZ: Right.
DR. WOOD: So if you want to just go
through them and refer to the slide
number, that
would probably be helpful to people.
DR. HERTZ: Okay.
If we go to the third
slide, you can get a sense of the sizes
of the
databases that were presented in the
individual
134
reviewer descriptions of FDA reviews.
A couple of points. The numbers there
reflect predominantly patients on the
drug of
interest as opposed to the entire
database. The
outcome studies are more reflective of
the entire
populations including comparators. These drugs
were assessed and have been assessed over
time in
fairly large numbers of patients.
I think it is useful to note
that we have
not
approved, in this country, all of the
COX-2-selective NSAIDs that have come to
us in
applications for a variety of
reasons. Some of
these may be related to
cardiovascular-risk
assessment. Some may be related to
non-cardiovascular-risk assessment which
we really
haven't gotten into in this setting.
In addition, you may also note
that
parecoxib has not yet been approved in
this country
although it has been approved
elsewhere. So I
think that we have a lot of issues to
consider with
these products.
When we reviewed the studies
that have
been presented, we see that there is some
increased
risk for cardiovascular events but one of
the key
issues here is that the results are not
consistent
135
across studies and across
situations. We also have
seen that there is risk that is being
associated
with some of the nonselective products.
So we have a story of
conflicting data. I
am up the Slide 5. We have data that has been
present across short- and long-term
studies, the
epidemiologic studies. The challenge is to compare
across populations, across
comparators. It is
striking that sometimes very similar
study designs
have very different results.
It is possible there is more
than one
mechanism. Again, the data has been inconsistent
with the NSAIDs. We also have conflicting
136
information coming back on what occurs in
the
context of concurrent aspirin use. It is really
unclear if aspirin use has a truly
meaningful
effect on whether there is any G.I.
benefit of the
COX-2-selective products. That has not been clear
either.
I have been asked to point out
that, in
addition, time to onset of risk is
something that
we need to consider very importantly,
too, which,
again, is something that is evident when
we look at
the study data and important in our
deliberations
for this.
So, in spite of this conflicting
data and
the many questions, we have to move
forward. We
have to determine what the role of
approved
products are on the market today, what
additional
studies are necessary, what studies would
be most
helpful.
I am going to summarize and
combine some
of the questions that we have posed. These are
questions we dearly would like input from
the
committee. To start, if we think about the first
137
three questions, does the available data
support a
conclusion that celecoxib, rofecoxib and
valdecoxib
significantly increase the risk of
cardiovascular
events.
Does the overall risk-versus-benefit
profile for each of these support
marketing in the
U.S.
If yes, in whom? And which of the
potential
benefits of celecoxib or the others
outweigh the
potential risks and what actions would
you
recommend that we consider implementing
to ensure
safe use?
I think it is also important to
understand
that some of these answers are going to
depend on
if we think that this is a fairly uniform
class
effect and, if not, we are going to have
weigh the
amount of information available for each
of the
products.
It is not the same. We don't have
the
longer outcome studies, for instance,
with
valdecoxib at this point.
Question 4 asks if the
available data
support a conclusion that one or more of
the
COX-2-selective agents increase the risk
of
cardiovascular events and what is the
role of
138
concomitant aspirin in attempting to
mitigate that
risk.
What additional clinical trials or
observational studies, if any, would you
recommend
as essential for us to further evaluate
celecoxib,
rofecoxib and valdecoxib?
What about to further evaluate the
potential G.I. benefits for these same
products?
Would you recommend that the labeling for
these
products include information regarding
the absence
of long-term controlled clinical-trial data
assessing potential cardiovascular
effects and if
you have a recommendation for how that
should be
conveyed in terms of warnings, boxes and
such.
What additional trials would be
essential
to evaluate the nonselective nonsteroidal
anti-inflammatory drugs particularly with
respect
to cardiovascular risk? Similarly, what will now
become essential for products under
development
prior to approval to help gain approval?
We have to determine what
studies would be
necessary to evaluate the cardiovascular
risk of
these products and how much information
do we need
139
to know about the gastrointestinal
risk? If
preapproval studies recommended as
essential do not
demonstrate an increased risk for a
cardiovascular
event, how would you propose the FDA
handle that
information in the labeling? Would the absence of
a
cardiovascular-risk signal preclude the need for
any warnings or precautions in the
labeling of a
new product or should we rely more on a
class
warning or precaution in the absence of a
signal of
increased risk in the preapproval
databases?
If you think a class warning is
appropriate, please advise with
particular
attention to whether you recommend it
apply to all
NSAIDs or only COX-2-selective NSAIDs.
So I want to thank everybody
here for
their time and their commitment to
helping us
through this extremely challenging
program and we
really look forward to hearing your
deliberations
and your recommendations.
Thank you.
DR. WOOD: Thank you very much.
The companies have also asked
for two
140
minutes to respond. We all heard the rules
yesterday so it is two minutes. Microphone gets
turned off two minutes later and just
keep moving.
Sponsor Responses
DR. HARRIGAN: Could I have Slide No. 1.
This is Harrigan from Pfizer. What I would like to
do is first to summarize what we know
about
celecoxib and what we think that tells us
about the
benefit:risk equation for that drug.
I make the point in this slide
about
Celebrex being extensively studies and to
remind
the committee of the contrast of the very
widely
used nonspecific NSAIDs. On the next point, we see
that efficacy has been demonstrated in
arthritis
pain and familial adenomatous
polyposis. Our
prescription data and observational study
data tell
us that approximately three-quarters of
patients
who are taking celecoxib are receiving
daily doses
of 200 milligrams or less.
Celebrex does have a favorable
G.I. safety
profile, a point emphasized by the very
relevant
G.I. safety findings that we heard about
this
141
morning from ADAPT compared to
over-the-counter
doses of naproxen.
Cardiovascular risk was not
detected in
the setting of treating arthritis
patients
understanding all the caveats about that
data that
we have heard over the past two
days. In APC, an
increase in cardiovascular risk was
reported
apparently in a dose-related
pattern. In contrast,
two additional long-term
placebo-controlled trials
did not find evidence of increased
cardiovascular
risk at daily doses of 400 milligrams.
The comment about the ADAPT
findings is
supported by the initial announcements
from
National Institute of Aging. We await that data
with great interest, particularly given
the size,
the duration in the elderly population
study which
would lead us to believe, expect, that
the number
of events in that trial will exceed the
number of
events in either or both of the other two
trials
combined.
The final ADAPT data and the
polyp
efficacy data will make significant
contributions