The rate of late stage clinical trial failures is the single biggest determinant of returns on pharmaceutical R&D. The lion’s share of discovery and development costs come at the end of the process, and if those trials fail (whether for safety or lack of efficacy), all the capital invested up to that point is lost.
The entire early development process, therefore, is designed to de-risk those large and expensive pivotal trials that can lead to approval and sales. Smaller and cheaper clinical studies are meant to predict the outcome of the larger, more costly ones to come.
But the track record of the industry at achieving this ideal is patchy. Precisely what the success rate of late-stage trials is remains something of a debate, not least because such trials are a heterogeneous bunch. Many such trials are performed with a drug known to work safely in one indication, to support label expansion. Others are conducted with drugs that replicate the mechanism of action of another agent already proven in the chosen indication. Such trials, one might imagine have a disproportionately higher chance of success than the first late-stage trials of an agent with an untried mechanism of action.
Hence, while the oft-quoted figure for late-stage trial success is around 50%, the track record for novel, first-in-class agents is considerably lower: DrugBaron previously estimated it to be below 25%. Whatever the precise figure, the current failure rate is unexpectedly, and unwelcomely, high.
I say “unexpectedly” because the statistical framework used in clinical research might be expected to yield a higher predicted success rate. Early-stage (phase 2) clinical trials typically adopt a success criterion of p<0.05 on the primary end-point. Most people understand the concept of p values: superficially at least, a p value below 0.05 suggests there was less than a 5% chance that the effect seen in the trial was due to chance alone (and a whopping 95% or greater chance that the drug had been effective).
To the uninitiated, at least, that might suggest that when the successful drugs in Phase 2 are moved into Phase 3 less than 5% of them should fail (the unlucky few for whom the apparently significant effects seen in Phase 2 were, actually, due to chance alone).
But there are other factors – many of them well known – that decrease the chances for the pivotal trials, even for a drug with robust efficacy in Phase 2. For a start, many Phase 3 trials are only powered to detect a clinically-relevant effect 80 or 90% of the time, even if it really is there. So that will yield, at least, some additional failures, although the parallel program of Phase 3 studies typical in the industry reduces this risk considerably. More insidiously, the pivotal trials often adopt a different end-point, agreed with the regulators, to the Phase 2 trials (where a “surrogate” end-point was used to predict whether the regulatory end-point is likely to be met). Unless this surrogate is perfect (and few are), some agents that are positive against the surrogate will be ineffective against the regulatory end-point.
Similarly, the pivotal trials need to be performed in less selected, and hence less homogeneous, patient populations (to better reflect the use of the drug in the real-world after approval). If the drug is more effective in the defined subset studied in Phase 2 than in the broader population, unexpected failure will again result.
Its harder to quantify the impact of these factors on overall Phase 3 success rates – not least because the degree to which they apply varies with the indication, and with the expertise of the strategists that design the clinical development plans. Clever design can mitigate, but not eliminate, the problems that stem from the expanded generalization needed in Phase 3 compared to Phase 2.
And, of course, add to all that the entirely avoidable, but nonetheless remarkably prevalent, tendency to progress agents into Phase 3 that did not actually achieve positive Phase 2 findings (at least without the help of unjustifiable post hoc analyses).
These problems are well known, broadly understood and an accepted risk associated with drug discovery and development.
But, arguably the biggest reason for “unexpected” clinical trial failure remains, inexplicably, outside of the usual debate on R&D success rates. Its lack of prominence is all the more surprising, given the furore its impact on other scientific endeavours has elicited. A major reason for unexpected late-stage clinical trial failure is a fundamental mis-understanding of the humble p-value.
After a study reads out ‘positive’ (that is, with a p value below 0.05), the chance of a further study now failing is not less than 5% as suggested above. It might even be as high as 70% depending on the overall experimental design. This has nothing to do with changing end-points or populations. In fact, it has nothing to do with clinical trials or biology – it is down to an inherent, and frequently ignored, property of statistics itself.
Meet Professor David Colquhoun from University College London, who has spent twenty years pointing out this problem to a mostly deaf world. His narrative, most recently in an excellent review published in Royal Scoeity Open Science, explains very simply why so many trials fail “unexpectedly” – and what simple steps can be taken to dramatically improve the situation. Simple steps that could materially increase the success rate of late stage trials, and hence the return on pharma investment. An insight that could, and should, be worth billions of dollars.
The foundation of the problem lies in the framework of hypothesis testing: we test a string of hypotheses (in life generally, as well as in pharma R&D), and it matters what fraction of those hypotheses were actually correct.
Prof Colquhoun illustrates the problem using the example of a screening test for cancer. Imagine performing a 100,000 tests for a rare cancer (only 1 in 1000 of those tested actually have it). Even if the test has excellent (95%) sensitivity and specificity – better than almost all real-world tests – then most of the “diagnoses” still turn out to be wrong. Here’s the math: with a specificity of 95%, 5 out of every hundred people without cancer yield a false positive – so with 99,900 tests on people without cancer we will collect 2,495 false positives. This contrasts with the 95 real positives (5 of the 100 with cancer get missed due to the 95% sensitivity). Only about 4% of the positive diagnoses of cancer turn out to be correct.
This phenomenon, often termed the False Discovery Rate (FDR) is (relatively) well-known and understood, and has led to calls (including, here and here, from DrugBaron) for cancer screening tests to be abandoned. Not only do they cost a lot for very little useful information, but they lead to considerable patient harm – both through anxiety caused by all those false positives, and also through over-treatment of non-existent disease.
But exactly the same maths applies to any string of repeated tests – including clinical trials.
If you run lots of phase 2 trials with different drug candidates where only a minority (lets say 10%) actually work, then with standard trial statistics (80% power and 5% false positive rate) you will get 4.5% false positive and 8% true positives – so less than 2 out of every 3 positive trial results were real. A much lower success rate than the 5% error rate commonly assumed.
If only the real-world situation were that good – sadly, the tendency is to lower the power of early-stage trials on the grounds “we are looking for big effect sizes, and so we can save money by keeping trials small and focused”. But lowering the power lowers still further the proportion of positive outcomes that are “true positives”. Prof Colquhoun estimates that under real world conditions, as few as 30% of all statistically-positive clinical trials represent true positives. The rest are statistical flukes.
Little wonder, then, that Phase 3 trials fail as often as they do. Not only do we have the generalization problem to contend with; not only do we have agents deliberately progressed when the statistics suggested failure – but perhaps half or more of the trials that WERE robustly positive were actually false positives!
And this fundamental, but pervasive, misunderstanding of what p values really mean doesn’t just apply to diagnostics and clinical trials. It applies very visibly in high throughput screening, which is a “perfect storm” for a high False Discovery Rate: millions of separate hypotheses tested in parallel (that each compound in the library has the particular effect sought), in an assay with very low power, and where vast the majority of the hypotheses were wrong. Armed with this knowledge the fact that most hits from such screens don’t validate on replication is exactly what you would expect.
It goes even broader: the False Discovery Rate underpins a large part of the incessant “irreproducibility of science” debate. With so many hypotheses been tested across the scientific literature, many of them in under-powered experiments, the number of false positives (which, of course, cannot be replicated) will be material. This point was made most dramatically by John Ioannidis, now at Stanford University, in his 2005 paper entitled “Why Most Published Research Finding are False”, which has been read almost a million times (something of a record for on-line research papers).
Of course, the precise magnitude of the effect has come in for some discussion and both Colquhoun and Ionnadis have faced criticism of their claims – although most, if not all, of the criticisms have been quantitative rather than qualitative (that is, arguing that these authors estimates of the False Discovery Rate are too high, rather than disproving the proposition altogether).
If the False Discovery Rate is causing so much grief, what are the available counter-measures?
The first and most obvious is understanding the problem. Once you learn to interpret p values correctly, you are well armed to protect yourself. But beyond that, some simple changes to your experimental framework (that is, your whole development workflow rather than the individual experiments) can help: for a start Prof Colquhoun recommends adopting a rather more stringent cut-off for statistical significance than p<0.05. Once your findings reach ‘triple sigma’ levels (that is, p<0.001) you are more or less invulnerable to false discoveries, at least in clinical trials.
Equally, don’t underestimate the importance of Bayesian Priors (that is other knowledge that supports a particular conclusion). If, for example, your primary end-point reaches statistical significance but every secondary end-point suggests no effect, its time to suspect the False Discovery Rate. Put another way, don’t let the data from one single experiment (however important) dominate the weight-of-evidence. The attitude “well, the trial was positive so it must work – so lets plough ahead” may well be tempting, but unless the broader picture supports such a move (or the p value was vanishingly small) you are running a high risk of marching on to grander failure.
Lastly, promoting “unlikely” hypotheses for testing aggravates the False Discovery Rate (since the lower the fraction of true-positive hypotheses in the pool, the higher the fraction of positive read-outs that will be false-positives). Boosting the average quality of your hypotheses before testing them (by thinking carefully, searching the literature and conducting a proper assessment of the likelihood of success BEFORE committing to test a hypothesis) will pay dividends too.
Failures in large, expensive Phase 3 trials are the principle cause of poor capital productivity in pharmaceutical R&D. Some of the reasons for failure are unavoidable (such as, for example, the generalization problem). But the False Discovery Rate is most definitely avoidable – and avoiding it could half the risk of late-stage trial failures for first-in-class class candidates. That translates into savings of billions of dollars. Not bad for a revised understanding of the meaning of the humble p value.
This article by DrugBaron was original published by Forbes, and can be found here
Total Scientific Ltd is a preclinical CRO based near Cambridge, UK. We specialise in developing and characterising bespoke in vitro assays for discovery and development, including enzyme assays, binding assays and immunoassays together with biomolecule interaction services (Biacore) Total Scientific is a niche contract research organisation that offers a range of in vitro laboratory-based …