• Hunting metabolomic biomarkers of acute pancreatitis by machine learning

Case Study : GSK


Eagle Genomics helped GSK to curate and catalog clinical/molecular datasets to identify biomarkers for acute pancreatitis.





Specialist Area

Smart Data Management



Acute pancreatitis[1]  (AP) is a disease of the extremes. It strikes suddenly and then, for three quarters of patients, resolves within days. For the remaining quarter the disease rapidly escalates, often resulting in multiple organ failure and even death. There are currently no reliable clinical tests to predict which course a patient is likely to follow. This is unfortunate in two regards. Firstly, prediction would allow a clinician to choose between a “wait and see” approach vs. aggressive treatment with accordant risks. Secondly, prediction would promote investment into drugs tailored for the severe form of the disease.


In the hunt for better molecular biomarkers for acute pancreatitis, GSK Discovery Partnerships with Academia (DPAc) had collaborated with the University of Edinburgh Department of Clinical Surgery on a prospective study. Typical of molecular biomarker discovery, whilst the patient cohort was small (under 100) the number of measurements per patient was large (over a 1000) with detailed clinical, protein and metabolomic, including longitudinal, data types available. The challenge was twofold; to identify features in this valuable and unique dataset that correlate with disease progression, and to build models to discriminate between severe and mild disease. High-dimensional data of this nature, with its mix of continuous and discrete response variables, presents significant problems to traditional machine learning analysis, often leading to false positives and correlations that cannot be reproduced later on.

Our Solution

The GSK project team knew of Eagle through existing industry-academic collaborations, and was aware of Eagle’s track record finding disease biomarkers in complex biological data. Initial discussions suggested that Eagle’s systematic and technology-agnostic approach to machine learning would be ideal for the complex multi-dimensional task in hand. Eagle’s professional services organisation was also able to flexibly fit with GSK timelines and provide the analysis for a reasonable up-front cost.

In brief, the Eagle platform for biomarker discovery involved:

  • Curation of the clinical/molecular dataset against biomedical standards including ISA[2] and EFO[3] and loading into the e[catalog] data catalogue.
  • Systematic application of multiple machine learning methods (support vector machine, penalized ordinal regression, logistic regression etc), to the catalogued data to maximise the chance that biological signals, if present in the dataset, would be discovered.
  • Feature selection to identify and annotate biomarkers with discriminative accuracy indicative of clinical utility.
  • Comprehensive reporting of results and identification of next actions.

[1] https://www.uptodate.com/contents/etiology-of-acute-pancreatitis

[2] http://dx.doi.org/10.1038/ng.1054

[3] http://dx.doi.org/10.1093/bioinformatics/btq099

Acute pancreatitis is the leading gastrointestinal cause of hospitalization in the US.
Causes include gallstones and alcohol consumption, although exact cause is not always clear.
Annual incidence is 35 per 100,000 population, increasing by 30% per decade
Mortality rate is 3% overall

The Benefits

Within weeks of receiving the dataset, results were being returned from Eagle to GSK. The dataset had been prepared, its limitations evaluated, and appropriate methods selected. Regular meetings between the Eagle, GSK and Edinburgh team ensured the validity of the biological context of the analyses. Statistically significant correlations between pancreatitis severity and metabolite abundance were uncovered, including signals that could be reproduced across different measures of disease severity using independent machine learning analyses. Through using the Eagle platform, the curated input datasets could be linked to the machine learning analyses, thus improving data governance and facilitating collaboration. Interesting disease/metabolite correlations, both expected and novel, have been uncovered by the machine learning analyses and are being followed up by the project partners.


This study has provided the partners with biologically-relevant insight that improves their molecular understanding of AP progression. The results are being used to justify investment into, and guide the design of, ambitious follow-up studies with much greater statistical power. By adopting the Eagle data platform for these studies the data was by design findable, accessible, interoperable and reusable (FAIR).

This study represents a significant progress towards a prognostic test of acute pancreatitis – a test that will enable better treatment and better drugs for this common and often fatal disease.