Case Study

International Cancer Genome Consortium

Selecting valuable patient cohorts for cancer biomarker discovery

icon-discover (1)

Industry

Pharmaceutical

Icon-Unify

Product

e[discover]

Icon-Share

Specialist Area

Translational Research

Providing users with prioritised datasets based on scientific value allows for improved data selection, encourages data reuse and hence makes datasets more precious.

Systematic data prioritisation is at the heart of Eagle’s translational medicine platform. In this case study we show how our platform was used to prioritise data in the context of a specific customer project, namely the identification of genetic (haplotype) associations with skin cancer prognosis from publicly available information.

Our starting point for this project was the International Cancer Genome Consortium (ICGC) dataset, with over 20,000 patient donors. ICGC is unique in providing links to primary sequence data across many contributing projects. This provided our association analysis to include a greater number of samples than any single project such as The Cancer Genome Atlas (TCGA).

fig1
Figure 1: stepwise process from data modeling to usage and exploitation

The general process for the translational medicine platform is shown in Figure 1. There are several software components used; e[catalog] for cataloguing the datasets, e[discover] for valuing and prioritising the data and e[hive] for running the association analysis. We will focus on e[discover] for this case study.

Step 1: Data catalogue

A prerequisite for e[discover] is a dataset of curated metadata managed in e[catalog] which collates and harmonises associated metadata (using biomedical ontologies). A software connector was configured to automatically transform ICGC metadata and link entries to primary sequence data in resources such as TCGA and the Cancer Genome Project (CGP).

Step 2: Model definition

Once the catalogue was in place, the next step was model definition, which is an expert driven process which assigns scores across various dimensions (value components) according to multiple stakeholder perspectives. The model is capable of systematically computing the value of data entries i.e. patient donors. We wanted to prioritise patient donors to ICGC by their usefulness and relevance to our association study, performed as follows;

Identification of criteria contributing to scientific value (Figure 2); these criteria are hierarchical Map value criteria to scaled data attributes. Scaling is used to convert nominal attributes to numerical values for quantification

step_2-1
Figure 2: Identification of dataset features contributing to scientific value (ICGC) – questions

Step 3: Valuation

A prerequisite for e[discover] is a dataset of curated metadata managed in e[catalog] which collates and harmonises associated metadata (using biomedical ontologies). A software connector was configured to automatically transform ICGC metadata and link entries to primary sequence data in resources such as TCGA and the Cancer Genome Project (CGP).