Open-data integration and the road to novel scientific discovery

Yasmin Alam-Faruque (left) is lead biocurator at Eagle Genomics and Denise Carvalho-Silva (right) is Scientific Outreach Lead at Open Targets. We spoke to them about the importance of successful data integration on the journey to novel scientific discovery.

Yasmin_Alam_Faruque copy


As the volume of life sciences data continues to rapidly increase, open-source datasets are becoming increasingly important in order to contextualise and validate privately held institutional data. But attaining the widest, richest and most informed view of a research area can prove challenging when a lack of standardisation makes it difficult to locate and curate relevant data.

Eagle Genomics’ knowledge discovery platform, the e[datascientist], is enabling researchers and enterprise organisations to overcome the challenges of isolated, siloed data by integrating the world's multidimensional open-source datasets with internal institutional data.

Open Targets provides one of these datasets. As a public-private partnership it uses human genetics and genomics data for systematic drug target identification and prioritisation, hosting an open-source database which has been integrated into the e[datascientist] since 2016.

Formulating hypotheses

“The Open Targets resources start with public data which we integrate into our system and add value to by scoring the supporting evidence behind target/disease associations,” says Denise. “This is publicly available so anyone can use it to make therapeutic hypotheses.”  

Data from Open Targets provides strongly evidenced associations between diseases and genes which enables well-supported investigation into these relationships. By integrating Open Targets data, alongside other open-source datasets, the e[datascientist] platform empowers users to access a wider and more contextually informed view of a research area of interest.

“Thanks to open-source databases scientists using our platform can be assured that the integrated data is of the highest quality,” explains Yasmin. “And therefore the insight they gain, using their own data unfied with the public data, represents either a true novel insight or an accurate verification of existing knowledge.”

The integration of public datasets is important for helping valuate a private institution’s own data by providing additional context, evidence and confidence to support novel discoveries. This is especially important when researchers are trying to improve public health and looking to alleviate specific conditions.

Integration enables discovery

“In order for researchers to understand the underlying biological mechanisms they need resources like those offered by Open Targets to provide supporting evidence and verify the particular gene/drug or gene/disease association,” adds Yasmin.

“Similarly for the formulations of consumer goods like anti-aging and acne creams, researchers need to know which genes and pathways are involved in those conditions so they can develop product formulations containing active compounds targeting the relevant molecular mechanisms.”

struggle-3805349_1920The skin condition acne is commonly associated with increased sebum production.

A common skin condition, acne, is usually associated with increased production of sebum (an oily substance released by skin glands to help keep skin and hair moisturised). Researchers may then want to investigate which chemicals can be used to modify sebum production and will need to identify the relevant gene targets. Open Targets data resources link genes to conditions via drugs and genetic evidence, providing researchers with the information they need to help develop product formulations containing an active agent which could be used to treat acne.

“The effective integration of data allows scientists to focus on what is most important to them,” adds Denise. “Whether that’s carrying out lab work based upon discoveries made with the data, or double checking and verifying supporting evidence.

“Effective integration of multiple datasets will expedite research developments. Individual labs, individual users, will not have to carry out the integration themselves and won’t be required to have data science expertise.”

Setting standards

But effective integration isn’t without its challenges. While data integration and collaboration are essential for enabling researchers to establish the most informed view of an area of interest, a lack of standardisation between datasets can prove problematic.

“Data integration is challenging for us all, including Open Targets,” says Denise. “One of the ways we try to tackle this challenge is by relying on standards. One of the things we rely on a lot is ontology tags. There is an ontology for everything, from disease to sequence variants and gene function. Luckily for Open Targets there were some great foundations, including ontologies, which we were able to build on when we began the project. But some researchers, especially those from the wet lab, are not yet clued-up on those ontologies, so we often need to translate and map their data into ontological terms.”

laboratory-2815641The use of standardised ontologies is still to be fully integrated into wet-lab research

Although public data resources are highly curated, each service is curated to the specification required for the data-type held in that database. For example, chemical data will be curated differently to gene or protein data. Some resources also rely on submission from researchers and have built comprehensive, easy-to-use submission tools to enable a certain level of standardisation. However, even with these tools in place, some data may be missing or may not comply with existing ontologies. It then falls to database curators to enhance and harmonise that data to ensure consistency. This can be an expensive, time consuming and labour-intensive process which not all institutions have the resource to carry out.

“It’s also important to cross reference between databases,” explains Yasmin, “so that it’s easy for researchers to be made aware of and find additional data which is relevant to their area of interest within other open-source databases."

Eagle's e[datascientist] brings together and integrates a number of open-source datasets, providing researchers with the most informed view of an area of interest by providing an informed contextual landscape for institutional data.

Open-data opens doors

To enable scalable and meaningful integration, data curators need to be able to rely on stable identifiers and standards, such as ontologies, that will not alter over time.

“Ontologies and mapping are what save us from the mess of inconsistencies,” adds Denise.

Open data is fundamental for providing researchers and organisations with an established starting point and for contextualising institutional data. Without it the process of identifying any novel treatments or products, from new drug targets to alternative compounds for use in commercial products, would take much longer.

“Without open data you couldn’t even start formulating your hypothesis!” says Denise. “It provides a baseline and a headstart. For example, understanding healthy individuals outside of the disease context is just as important to understanding how a disease works as looking at data from individuals with a particular condition.”

“Ontologies and mapping are what save us from the mess of inconsistencies!”

The Eagle Genomics’ platform provides a unique solution to the disparate data challenge by integrating institutional data with a range of public datasets and data types. This data can then be accessed via the platform's intuitive user interface without requiring data science expertise.

“A researcher can integrate their own experimental data in order to verify experimental outcomes or pave the way for new discoveries by identifying entity relationships which were not previously visible within institutional data alone,” adds Yasmin.

“Vitally, the integration of public datasets can reveal completely new avenues for exploration that have previously been overlooked.”

Learn more about e[datascientist]

More about Open Targets resources

The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank gene-disease associations for drug target identification and prioritisation.

Open Targets Genetics integrates variant-disease association from GWAS Catalogue, UK Biobank and functional genomics datasets for the prioritisation of target genes at disease-associated loci.