• Selecting valuable patient cohorts for cancer biomarker discovery

Case Study : International Cancer Genome Consortium

International Cancer Genome Consortium





Specialist Area

Translational Research


Providing users with prioritised datasets based on scientific value allows for improved data selection, encourages data reuse and hence makes datasets more precious.

Systematic data prioritisation is at the heart of Eagle’s translational medicine platform. In this case study we show how our platform was used to prioritise data in the context of a specific customer project, namely the identification of genetic (haplotype) associations with skin cancer prognosis from publicly available information.

Our starting point for this project was the International Cancer Genome Consortium (ICGC) dataset, with over 20,000 patient donors. ICGC is unique in providing links to primary sequence data across many contributing projects. This provided our association analysis to include a greater number of samples than any single project such as The Cancer Genome Atlas (TCGA).


Figure 1: stepwise process from data modeling to usage and exploitation

The general process for the translational medicine platform is shown in Figure 1. There are several software components used; e[catalog] for cataloguing the datasets, e[discover] for valuing and prioritising the data and e[hive] for running the association analysis. We will focus on e[discover] for this case study.

Step 1: Data catalogue

A prerequisite for e[discover] is a dataset of curated metadata managed in e[catalog] which collates and harmonises associated metadata (using biomedical ontologies). A software connector was configured to automatically transform ICGC metadata and link entries to primary sequence data in resources such as TCGA and the Cancer Genome Project (CGP).

Step 2: Model definition

Once the catalogue was in place, the next step was model definition, which is an expert driven process which assigns scores across various dimensions (value components) according to multiple stakeholder perspectives. The model is capable of systematically computing the value of data entries i.e. patient donors. We wanted to prioritise patient donors to ICGC by their usefulness and relevance to our association study, performed as follows;

  1. Identification of criteria contributing to scientific value (Figure 2); these criteria are hierarchical
  2. Map value criteria to scaled data attributes. Scaling is used to convert nominal attributes to numerical values for quantification

Figure 2: Identification of dataset features contributing to scientific value (ICGC) – questions

Step 3: Valuation

Once the model is complete it can be run on the entries in the catalogue (Figure 3); at this point the model criteria are weighted according to the intended scientific questions. Examples of scientific questions for which we would like ICGC data prioritised, include:

  • Which genetic markers are associated with prognosis of a disease? (question for this blog)
  • Which are the driver mutations of a cancer?
  • Which genes are differentially expressed in a disease?

Figure 3: Final valuation model for the ICGC dataset

Step 4: Valuation results and exploitation

The results of running our valuation model in the context of “identify genetic markers predictive of cancer prognosis” on the metadata from 734 ICGC skin cancer patient donors are shown in Figure 4. For our final association analysis we wanted to include the most valuable 200 patients donors and since these were from crossed multiple projects, some with restricted access; permission was granted for data access within ICGC. Additional useful information can be discovered from the valuation data and we will return to this later.


Figure 4: Valuation of data in e[discover] from skin cancer donors according to the context “identify genetic markers predictive of cancer prognosis”

Step 5: Data analysis

Data analysis on the translational medicine platform, uses the e[hive] workflow engine.

Once the 200 patient donors had been identified, the e[hive] Haplotype Association Analysis workflow was applied to assist with target and indication prioritisation. The workflow was able to select these donors from e[catalog], follow the links to the primary sequence data, retrieve the sequences and perform the analysis. An example gene (Gene-A) was identified by this analysis where putative associations are shown between haplotypes and skin cancer prognosis (Figure 5). Such haplotypic analyses can be used to generate biomarkers, assist with stratification of patients and perform biological analysis of targets.


Figure 5: ‘Gene-A’ haplotypes split by melanoma prognosis (200 patients: 100 alive, 100 deceased) as generated by the e[hive] Haplotype Association Analysis workflow


Figure 6: The significance of availability of omics data types; completeness vs. importance

Step 6: Value-guided curation

The data journey within this Platform does not end once the analysis is complete. Our data valuation model allows various aspects of the dataset to be compared, e.g. the availability of a range of -omics data types(Figure 6). Comparison of data completeness (a measure of data quality) with criteria weights from the model (a measure of importance) allows us to ascertain where we should invest further (points to the left in Figure 6); in the case of ICGC, investment should be made in the generation of Structural Somatic Mutation (STSM) data (green triangle). These results indicate that completeness of clinical data plays a major role towards improving the value of some donors vs. the contribution of other criteria.

Furthermore, data harmonisation using ontologies enables a unique question-driven approach to explore scientific value across diverse datasets, that was previously not possible, allowing for definition and calibration of the valuation model. Once modelling is complete, valuation can be applied to the catalogue and relationships discovered between components contributing to further iterative model refinement and conversation around scientific relevance.


The systematic data organisation and valuation model provided by Eagle’s translational medicine platform allows for fast and effective patient selection for cohort building, followed by robust and reproducible correlation and association analysis.

We demonstrated the benefits of our prioritisation approach whereby we were able to select and prioritise the most relevant patients on explicit, well understood criteria and access their associated datasets in order to run complex comparison analysis between groups of patients to identify biomarkers, assist with stratification of patients and perform biological analysis of targets.