Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Researchers from the Department of Statistics and the Nuffield Department of Medicine are among the senior principal investigators on a £8 million government-backed consortium creating the world's largest dataset for training machine learning models in drug discovery.

The Radcliffe Camera building, Oxford

OpenBind, a new £8 million consortium, will create the world's largest open dataset of experimentally validated drug-protein interactions. Over the next five years, the project will generate more than 500,000 protein-ligand complex structures and affinity measurements – a 20-fold increase over all public data produced in the last half-century.

Most medicines work by binding to specific proteins, but researchers have historically lacked sufficient high-quality data about these interactions to train AI systems effectively. This data shortage has been a barrier to using artificial intelligence to predict which new compounds might work as drugs, leaving pharmaceutical companies reliant on testing methods that can take decades and cost billions.

'OpenBind realises a major gear-shift for AI in drug discovery by investing in the data that powers it,' said Professor Charlotte Deane from the Department of Statistics, one of eight senior consortium principal investigators. 'This funding will mean we can begin generating a catalogue that not only dwarfs in quantity everything messily accumulated over half a century, but transcends it in quality and is geared towards powering the AI algorithms.'

Professor Deane is working alongside Oxford colleagues from the Nuffield Department of Medicine including Professor Frank von Delft (who also holds a position at Diamond Light Source) and Professor Paul Brennan, as well as an international team including Nobel Prize winner Professor David Baker from the University of Washington, Dr John Chodera from Memorial Sloan Kettering Cancer Centre, Professor Mohammed AlQuraishi from Columbia University, Dr Mark Murcko from Relay Therapeutics, and Dr Ed Griffen from MedChemica Limited.

'This exciting project will harness the disruptive power of AI and physical science to change how we find medicines for many of the disease that afflict humanity. Its success will help everyone,' said Professor Jim Naismith, Head of the MPLS Division.

The consortium will deploy automated chemistry and high-throughput X-ray crystallography at Diamond Light Source, the UK's national synchrotron facility in Oxfordshire, to generate precise molecular interaction data structured for AI training.

The foundational dataset from OpenBind will underpin progress across multiple areas of technology, including structure prediction, generative molecular design, docking, and active learning workflows. The scale and quality of data – 500,000+ structures compared to the 25,000 currently available – will enable training of statistical models that were previously impossible due to data limitations. The dataset is designed to work in synergy with other emerging approaches to help reduce trial-and-error experimentation, inform candidate selection, and support more systematic exploration of chemical space.

Backed by the UK government's newly established Sovereign AI Unit, OpenBind positions the Department of Statistics and the wider MPLS Division at the forefront of AI-driven scientific discovery. The project will help train the next generation of AI models for drug discovery while establishing new standards for open scientific data sharing, as part of the government's broader Plan for Change.

The project also has potential applications beyond healthcare, supporting research into engineering biology solutions for challenges such as developing new enzymes to tackle plastic waste.