We need to industrialize virtual synthetic repurposing trials

A call for a Focused Research Organization

Jul 10, 2025

We need to industrialize virtual synthetic repurposing trials (or, as they’re more commonly known, emulated trials)1.

By virtual synthetic repurposing trials, I mean using healthcare data, like the healthcare records of national medical services, to examine how different drugs, which coincidentally are given with certain diseases, can slow or stop those diseases. The details on how this is done is given in the name.

1. Virtual, because they are based on data that already exists in the cloud of health datasets

2. Synthetic, because they require combining together a bunch of matched cases, including controls, rather than actually running a trial where everyone gets the same treatment

3. Repurposing, because these are drugs that already exist and already have coincidentally been given to the patients in question

4. Trials, because we’re going to try to do everything else the same, including randomization and controls. This will be tricky, given the synthetic part, but we can approximate.

These trials already exist, and they’ve had some successes. There have been virtual synthetic repurposing trials that have found:

1. A 51% reduction in esophageal cancer among Taiwanese patients given metformin

2. A 21% reduction in all-cause dementia and 20% reduction in Parkinson’s among Korean patients given SGLT2 inhibitors

Both of these have sparked additional analyses, follow-on trials, and, for at least the SGLT2 inhibitor trial, changes in clinical practice.

The issue with these trials, though, is that they’ve all been bespoke. They’ve all required 10+ academics to get the regulatory clearance, clean the data, then comb through the massive dataset and slowly test their one or two hypotheses. Then, once the data is released, it’s released as a pdf and there’s literally nobody who can or will replicate their work or even see the provenance of it.

It doesn’t need to be like this. The tech industry has gotten really good at analyzing massive datasets. If we set up a non-profit, like an FRO, that took this seriously as a tech challenge, we could be releasing a new one of these studies every week, along with “reverse repurposing” trials that see which diseases are causally connected with which other diseases, like the awesome EBV causes MS study I’ve discussed before. These could all be released as interactive dashboards with technical info about each virtual trial and what assumptions were made.

There have already been some efforts in this direction, like OHDSI and EHDEN, but these are volunteer-based efforts that focus predominantly on software that researchers can use and data science tools. This is important, but there’s only so much they can do as volunteer organizations to enforce quality and cadence, and there are no efforts towards making this info more accessible than the standard paywalled PDFs. And, besides, the biggest problem isn’t on the software side: it’s on the regulatory.

The best longitudinal healthcare datasets are locked behind regulatory bottlenecks. Not only are there privacy and data protection requirements to access the datasets, but they tend to only be accessible by a select group of researchers. The Taiwanese national health dataset can only be accessed by a National University of Taiwan researcher or their affiliate. Likewise, the US Department of Defense military health dataset, which includes the frozen serum samples that were so crucial to the EBV/MS study, can only be accessed by DoD researchers and their affiliates.

So, every single study ends up being bottlenecked not only by the resources of the scientists involved, and not only by the regulators, but also by the time of the few key people who are allowed to actually run the studies. There’s a better alternative: industrialize it.

By that, I mean have full-time teams for the software development, the data cleaning, and the regulatory. The software development team should just work on developing a regulatory sufficient pipeline that auto-runs on the datasets, taking in suggestions for specific things to check from outside scientists. The data cleaning team should solely work on adding new health records and making sure they are up to snuff.

And last, the regulatory team needs to have pre-existing relationships with every single agency that controls access to health records. They need to communicate with the software team, making sure their work is following the appropriate regulations (e.g. GDPR). They also need to communicate with the data cleaning team, making sure the data that’s exposed to the public abides by privacy requirements. But, they also need to be proactive, trying to push agencies to allow more access and more thorough access. This should be a combination of lawyers, lobbyists, and scientific liaisons.

The output of all of this should be an interactive dashboard with APIs. Ideally, anyone will be able to see all of the correlations and trials that were run, how they were run, and perform their own meta-analyses. The metric of success will be engagement with the dashboard and APIs, including publications, changes in clinical practice, and follow-on trials.

Will there be questions about validity? Of course. These sorts of correlational trials are never perfect. Any time you’re blindly running a bunch of statistics on a huge dataset you will get spurious correlations, as well as issues with selection biases, measurement biases, and everything else that comes with relying on data input by harried primary care doctors who never knew their work was going to be used like this.

But the potential payoffs are still immense. Any strong signals, especially if they can be mechanistically backed2, backed by Mendelian randomization, or, ideally, both, will be an immediate cause for confirmatory trials. It could also change how we prioritize biomedical development, like how the EBV/MS study made an EBV vaccine much more important.

I can write a full budget out for this, including some chances for revenue (e.g. priority analysis slots and API keys), but I’ll stop here. Hopefully, someone can take this ball and run with it.

Like it? Share it!

Note: click the link for a good discussion on the technicalities of emulated trials and how best to run them.

Like by the EvE FRO.

Jacob N Oppenheim

Jul 10

Have a pretty good sense of how you'd do this. Let's talk!

To whit, I think there's a very clear 1st trial you could run that would get funders excited.

Expand full comment

1 reply by Trevor Klee

Savva Kerdemelidis

Jul 11

I agree 100% that RCT emulation is the future of drug repurposing, but there's no business model for off-patent therapies, unless we can get payers involved. I wrote my Masters thesis on this topic over 10 years ago and established Public Good Pharma (PGP), as a social enterprise, owned by my NZ charity Crowd Funded Cures, to solve this problem.

We run Interventional Pharmacoeconomics (IVPE) trials, where forward-thinking payers (especially self-insured employers) fund trials for low-cost therapies (including repurposed generics) using their immediate and projected drug savings from reduced utilization of expensive alternatives. Unlike PBM-aligned insurers that earn hundreds of billions of dollars from rebates / admin fees and spread pricing charged for expensive therapies, these payers have flexibility and financial incentive to validate cheaper alternatives.

The Dutch Treatmeds.nl initiative has already shown this works - 12+ IVPE trials with a projected 22x ROI and €418M in net savings by 2030 (see https://treatmeds.nl/studies).

Examples:

- Rituximab vs. ocrelizumab for MS (NOISY REBELS trial - see https://clinicaltrials.gov/study/NCT05834855)

- IV ketamine vs. esketamine for depression, where ketamine is cheaper, has faster onset, and esketamine costs ~$198K/QALY (see https://pubmed.ncbi.nlm.nih.gov/33022440/ and https://www.osmind.org/blog/esketamine-and-iv-ketamine-for-major-depression and https://www.valueinhealthjournal.com/article/S1098-3015(22)00506-X/fulltext?).

- Dose de-escalation of oncology drugs such as pembrolizumab, which were approved on the basis of maximum tolerated dose and where patients are being exposed to unnecessary side effects and risks of secondary cancers through overtreatment (see https://ascopubs.org/doi/full/10.1200/JCO.22.01711).

RCT emulation helps de-risk IVPE trials and identify biomarkers for which patients benefit most, key for ethics approval if comparing SoC of a low-cost or off-label drug. Our pipeline includes dozens of similar opportunities: publicgoodpharma.com/pipeline

Payer-funded trials also offers pharma/biotech a low-risk route to fund Phase 2/3 studies so they can generate more ROI - payers spend trillions on drugs that may be significantly less effective than low-cost alternatives (e.g. your PrEP for MS example). There is also unlimited demand for QALYs if we factor in healthspan / lifespan extension. It would also disincentivise the kinds of low-innovation product hopping / evergreening strategies we see in pharma (e.g, Merck's new subcutaneous dose of pembrolizumab). Capitalism working properly should push us in the direction of lower costs per QALY, but we see the opposite happening - it is not sustainable.

Lots of money on the table, if we can get payers to come to the party.

3 more comments...

Trevor Klee’s Newsletter

Discussion about this post