Skip to content

Selecting valuable patient cohorts for cancer biomarker discovery

Industry Biopharma | Product e[datascientist]

International Cancer Genome Consortium

Providing users with prioritized datasets based on scientific value allows for improved data selection, encourages data reuse and hence makes datasets more precious.

Systematic data prioritization is at the heart of Eagle Genomics’ e[datascientist] platform. In this case study, the platform was used to prioritize data in the context of a specific customer project, namely the identification of genetic (haplotype) associations with skin cancer prognosis from publicly available information.


The starting point for this project was the International Cancer Genome Consortium (ICGC) dataset, with over 20,000 patient donors. ICGC is unique in providing links to primary sequence data across many contributing projects. This provided the association analysis to include a greater number of samples than any single project such as The Cancer Genome Atlas (TCGA).


Figure 1: Stepwise process from data modeling to usage and exploitation


The general process for the translational medicine platform is shown in Figure 1. There are several software components used; Catalog for cataloguing the datasets, Valuation & Decision Engine for valuing and prioritizing the data and Analysis Hub for running the association analysis. We will focus on the Valuation & Decision Engine for this case study.


A prerequisite for the Valuation & Decision Engine is a dataset of curated metadata managed in the Catalog, which collates and harmonizes associated metadata (using biomedical ontologies). A software connector was configured to automatically transform ICGC metadata and link entries to primary sequence data in resources such as TCGA and the Cancer Genome Project (CGP).

Once the catalogue was in place, the next step was model definition, which is an expert driven process which assigns scores across various dimensions (value components) according to multiple stakeholder perspectives. The model is capable of systematically computing the value of data entries i.e. patient donors. Eagle Genomics wanted to prioritize patient donors to ICGC by their usefulness and relevance to the association study, performed as follows.



Identification of criteria contributing to scientific value (Figure 2); these criteria are hierarchical Map value criteria to scaled data attributes. Scaling is used to convert nominal attributes to numerical values for quantification.

Figure 2: Identification of dataset features contributing to scientific value (ICGC) – questions

Go Back
Topics: Data valuation, data catalogue, cancer biomarker discovery

Curious to know more?