Posts Tagged : AWS

Chef and DevOPs at Eagle

In my day to day work at Eagle, all of the projects I work on need tools installed. Bioinformatics tools infrequently come in one simple bundle and sometimes the dependencies can be vast and platform dependant. Here at Eagle we embrace the culture of DevOps to retain our sanity!


We use Chef to manage our tools deployment, whether it be onto an AWS instance, a VM or a Docker container. If we are going to use a tool on a project then it’s pretty likely we’ve used it before. This week however I needed to install REAPR to evaluate a genome assembly I’ve been playing around with. Normally I like to email the author and let them now that we’ve written a cookbook that deploys their hard work but this week I decided to tell the world via our blog ;-).   I’ll tell the author when I see them around the Genome Campus now we have moved (apparently one or two bioinformatics tools have been developed here over the years, not to mention formats, compression algorithms, assemblies, the list goes on!)

There are a LOT of bioinformatics tools out there and we only support a couple of flavours of linux at the moment but it’s a work in progress; currently we have 35 publicly available NGS tools cookbooks at our github repository:

All of the code is tested using Rspec, ServerSpec and more recently: InSpec. The projects are continuously integrated using Jenkins, KitchenCI, Foodcritic and Rubocop.


It’s good to be able to give something back to the community by open sourcing our Chef cookbooks!


FierceCIO’s Fierce 15 Cloud-First Companies for 2015

Go Eagle Genomics!  FierceCIO announced their “Fierce 15” Cloud-First Companies for 2015. Having demonstrated innovative ways of cloud adoption strategies, Eagle proved itself as a forward-thinking life science company utilising cloud-based technologies for business systems and product offerings.   Read the full FierceCIO’s 2015 winners list.

Using the Cloud for Global Collaboration in Life Sciences Research

In the Fierce 15  winner’s profile, FierceCIO described Eagle in the following terms  “U.K.-based Eagle Genomics has always been a fan of cloud services and its operational and business systems are built entirely around a cloud-first strategy. The company provides software and services for big data scientific datasets, including genomics and clinical data, for customers in the life sciences sector. It makes sense that Eagle Genomics uses the cloud to deliver its goods and services since so much life sciences research is done globally and collaboration is the rule of the day.”

In the interview with Eagle’s CTO, Glenn Proctor explained that Eagle Genomics has always had an effective business and operational culture with cloud-first strategies — particularly as a member of the AWS partner network.  Glenn stated “We were able to take advantage of the elasticity, scalability and flexibility that they offered and use them to rapidly grow our business.”

Read Glenn’s full interview with FierceCIO.

Life Sciences in the Cloud: Everything Changes, All The Time

glenn blog

Earlier this month I gave a talk at the Amazon Web Services’ (AWS’) “Cloud Computing and Life Sciences” event in Cambridge, UK. Eagle Genomics is an AWS Consulting Partner, and it’s great to be invited to speak at events like this because it helps keep us in touch with the trends in cloud computing. This blog post will summarise the talk, in particular the trends towards cloud adoption that we’ve seen over the seven years since Eagle was founded.

Note: These views are from the perspective of a UK SME; we work with lots of clients on a variety of projects so I have a pretty good overview of the industry. There’s only one caveat I’d add; Eagle is well known for our work with cloud solutions, particularly AWS, so there’s probably some section bias towards cloud in our customer base.

From my vantage point, I see that science, attitudes, strategy, language, and skills are all changing in the world of genomics IT. I’ll take each of these areas in turn.

SCIENCE changing

The science that companies and academic institutions are doing is changing far faster than traditional IT infrastructure can keep up with. Years ago it was sufficient – and fun – to build your own BeoWulf cluster in a spare office; however, that just doesn’t cut it any more. The demands of modern science, particularly genomics, mean that in-house computing budgets, resources and expertise have been really challenged to keep up. I’d argue that the cloud is the best solution for the majority of use-cases. The exceptions are the few institutions with very large-scale, nationally or internationally funded infrastructure, for example the European Bioinformatics Institute and Wellcome Trust Sanger Institute in the UK, and the National Center for Biotechnology Information in the US.

ATTITUDES changing

Eagle has long-established relationships with a number of customers, and we see attitudes changing over the course of those relationships. For example, I recall a meeting in 2012 with a particular client, where they said (I’m paraphrasing slightly) “It has to be internal unless there is a good reason to put it on the cloud”. Fast forward to 2015, and the same client is now saying  “It has to be on the cloud unless there is a good reason for it not to be”. So there’s a spectrum of cloud adoption. Companies are starting from different places and moving along the spectrum at different rates, but the direction of movement is almost universally towards the cloud, not away from it.

That said, Eagle is a business and we need to work within any constraints that our customers require, and some of those customers want an on-premise solution only. For that reason all Eagle solutions are deployable on AWS and on-premise – we make use of technologies like Chef,Ansible and Docker to help with this, but that’s another blog post! This requirement does limit our use of AWS-specific technologies, but the economics of the situation mean that, for the moment at least, the additional overhead of building the ability to deploy anywhere is justified for us.

STRATEGY changing

When cloud solutions first came along, they were seen as a version of your own IT infrastructure, but residing elsewhere. Now, it’s the case that cloud concepts and practices are influencing and shaping in-house computing resources; this is exemplified by the language people use.

LANGUAGE changing

A few years ago, we talked about data centres, grids, network fabric, job schedulers and so on. Of course, those terms haven’t gone away,  but we’re finding that terms and technologies like virtualisation, job clustering, Docker and infrastructure automation which came of age on the cloud are increasingly being used by in-house IT teams, as evidenced by the rise of the slightly misleading term “private cloud”.

SKILLS changing

The cloud has lead to the democratisation of (virtual) hardware availability; everyone’s a sysadmin now! We’ve seen the rise of the devops role. Using AWS as an example, there is still a low barrier to entry – a computer and a credit card is all you need – but the number and complexity of the various services which AWS offers is increasing rapidly. This means that we can no longer expect our bioinformatics consultants and software developers to keep up with all of the details of AWS services.

So we’re starting to hire people to do devops only, and in some cases looking at Platform as a Service (PaaS) solutions like Heroku as a more appropriate fit for some of our deployments, particularly eaglecore, our flagship metadata management product.

To summarise, we’re seeing that the pace of change is accelerating, the overwhelming momentum is towards the cloud, and this is enabling both new types of business and new companies. It’s been an exciting few years, and the next few promise to be even more so!

What do you see as the biggest changes in life science IT systems and architecture?

Eagle CTO, speaking at AWS Cloud Computing And Life Sciences Event

Cloud Computing, Life Sciences and Genomics

Glenn Proctor, Eagle’s CTO, will be speaking at Amazon Web Services’ (AWS) conference on Cloud Computing and Life Sciences scheduled for 1st July 2015 at the Homerton Conference Centre, Cambridge, UK.

Glenn will be speaking about: Genomics in the Cloud: How Eagle Built a Business on AWS

“Science is unlocking the complexity of biological systems at an increasing rate. In almost every biological field, large-scale compute resources are required. Previously these were the exclusive preserve of large companies and universities, however the advent of cloud computing, and AWS in particular has enabled companies like Eagle to emerge and become leaders in the field, despite having no in-house compute resource.

In this talk, I will describe how Eagle has done this over the seven years since we were founded, and describe the scientific objectives, implementation and results of our work with clients in industries as diverse as healthcare, research, consumer goods and agricultural technology.”

As Chief Technical Officer, Glenn is responsible for managing and improving Eagle’s systems to best support customers. During Glenn’s tenure at Eagle, he has set the technological direction of the company and spearheaded the implementation of the company’s flagship software product: eaglecore, as well as building and leading the teams which deliver the company’s products and services. Glenn oversees Eagle’s software development and works with leading customers to define, implement and deliver solutions. Glenn previously led the software development of Ensembl at the European Bioinformatics Institute (EBI). Prior to that, Glenn explored business applications in areas as diverse as artificial life technologies, mobile phone capacity planning and games.

Registration to the AWS Cloud Computing and Life Sciences event is free to attend.

ENCODE: A beachcomber’s guide to the genome

(CC BY 2.0)

ENCODE press coverage focused on their ‘de-junking’ of the genome. But semantic wrangling apart, what of the ENCODE legacy?

The public release of ENCODE (ENCyclopedia of DNA Elements) last week provides the likes of Eagle and our customers with a veritable cornucopia of new toys to play with. Of particular interest is the ENCODE Virtual Machine which contains much of the software used in the analysis. Although supplied as a VirtualBox VM, migration to AWS is reasonable. Whether ENCODE follows modENCODE and 1000 Genomes to EC2 remains to be seen, but is clearly to be encouraged.

Much of the immediate scientific reaction to ENCODE (see the roundup from OpenHelix) concerns their ‘de-junking’ of DNA, claiming that “80% of the genome is functional”. Really? Much of the argument is caused by ambiguities in the term ‘functional’. Following ENCODE it’s now clear that most genomic DNA ‘functions’ (verb) in a biochemical sense (i.e. sticks to the cellular machinery), but it does not follow that all these interactions have a ‘function’ (noun) in a biological sense (implying some wider purpose). For a scientific context see Sean Eddy’s excellent post on the subject. This distinction between ‘functional (v)’ and ‘functional (n)’ has proved problematic in the past. For example, the EBI Functional Genomics Group annotates the biological function of genes, whereas Ensembl FuncGen (also at the EBI) focuses on DNA binding. The latter is now referred to as “Ensembl Regulation” to avoid confusion.

Such nuances in definition are not new to genomics; the term ‘gene’ for example is variously used to refer to a ‘unit of heredity’ vs. a ‘genomic region producing an mRNA’. The distinct advantage of the latter; it’s much easier to comprehensively identify genes in genomic DNA using mRNA evidence than it is using the classical genetics definition. Which brings me to the reason I’m excited by ENCODE; much like RNA-Seq data is used to predict genes, the ENCODE data will be used to build of a comprehensive catalogue of ‘regulants’ (genomic regions putatively regulating mRNA transcription). The Ensembl Regulatory Build is, of course, the epitomy of this approach. Such catalogues will form invaluable frameworks for the systematic annotation of epigenetic processes in genomic context. We derive, therefore, a neat solution to satisfy both ends of the functional vs. functional debate.

To end with the thoughts of Chief ENCODE-ian Ewan BirneyThe real measure of a foundational resource such as ENCODE is not the press reaction, nor the papers, but the use of its data by many scientists in the future.

Storage at a glacial pace

Amazon’s announcement of its new Glacier storage service last week was a great example of a company listening to the needs of its customers then acting accordingly. I should have blogged about it already, but was waiting to read up on the details a bit first to see if it could actually be applicable to any real-life use cases. Unsurprisingly, it can.

Existing storage from Amazon, whether S3 or EBS, can cost around 10 US cents per gigabyte per month, or roughly US$1,200 per year per terabyte. There is no such thing as a typical pharma company when it comes to NGS, but two examples I have come across recently suggest annual data production of 300 terabytes is not unusual. As most life sciences researchers never like to throw anything away when it comes to research data, then over five years this adds up to 1.5 petabytes at a whopping annual cost of $1.8m, or thereabouts. Even at the upper end of estimates of internal storage costs at big pharma, this is still a significant markup on what they’d pay to provision the same storage themselves.

S3 and EBS provide an instant-access service. Data is available on-demand, is always online, and is stored in a highly redundant manner to reduce the risk of loss through hardware failure. But, most people dealing with NGS data don’t need this. It needs to be instantly available for the duration of its initial analysis so that it can be accessed by various pipelines and applications in use by the researchers, but once analysed the raw data is hardly touched again. It needs to be kept for a variety of reasons – regulatory compliance, cost of replacement, and peace-of-mind are just a few – but it doesn’t need to be instantly accessible, as long as it can be accessed on request.

In-house solutions solve this by archiving older data onto tape or offline disks which are then disconnected from the online systems and stored securely elsewhere. If the data is ever needed again (which, in most cases, it never is), the correct archived tape or disk has to be identified and reconnected to the system so that the data can be transferred back into live, online storage. There is a delay associated with this, but the cost of waiting is far lower than the cost of keeping the data online at all times forever.

Glacier replicates this approach within the cloud. It provides a virtual offline storage method for archiving away old or infrequently used data. In return for the slower access time (3-5 hours per request) when the data is retrieved, the storage cost is slashed to around 1/10th of the comparable S3 or EBS costs. The terabyte-year cost comes down to just $120, or for the rapidly-growing pharma example above, an annual cost of $180k after five years. This is a massive saving of $1.6m, and severely undercuts the cost of keeping the data in-house too.

Of course there are catches. Retrieving data from Glacier is free up to certain limits, but thereafter the requests are charged at the same per-terabyte fee as the monthly storage charge. Therefore once the initial limit is consumed, it will cost about $10 to recover a terabyte of data from Glacier. This is still a very small price to pay for securely archiving data that, in general, will never need to be accessed again but must be kept available just in case.

We’re already looking at ways that we can integrate Glacier into Eagle’s own services. Its future promise of direct automated archival from S3 is also very exciting and we can’t wait to try that out when it becomes available (Amazon suggest there might be a 12 week wait on this feature – so maybe in time for Christmas?).

A comparison of traditional vs. cloud HPC

Strait of Magellan Oil Platforms

Strait of Magellan Oil Platforms. SpecMode CC BY-SA-3.0

One area where the cloud is receiving a lot of attention is in High Performance Compute (HPC). From a sheer number of cores perspective, the scalability of some platforms, such as Amazon Web Services (AWS), has been proven. See Cycle Computing‘s 50,000 node virtual supercomputer for example. But HPC is not simply about CPU cycles. This blog takes a look at some of the arguments for and against cloud HPC from the perspective of a specialist company like Eagle. And from our perspective, without a few $M or so to purchase an on-premises solution, we can only consider hosted provision.

A detailed study comparing trad-HPC with cloud-HPC was published this February by the US DoE. See Scientific Computing’s synopsis, and Eagle’s earlier response. Although enthusiastically accepting the benefits of cloud computing, the report was generally critical of the technical suitability of cloud-HPC when compared to trad-HPC for scientific workflows. Some unbiased comments, given that Eagle have experience with both trad-HPC and cloud-HPC, are as follows;

Per-component performance of trad-HPC is undoubtedly better. Most significantly, network latency is typically an order of magnitude lower for trad-HPC than cloud-HPC. The reason for this is that compute nodes in trad-HPC are physically close, and linked by high performance (e.g. infiniband) connections, whereas cloud machines are generally connected by plain ol’ ethernet, and may well be spread across multiple data centres.

Where latency really matters is for workloads that are memory bound, and optimisation relies on Message Passing Interface (MPI) libraries. A great example from bioinformatics is genome assembly, which typically requires access to 100s GB contiguous memory. I am yet to see a quality assembly of a large genome performed on any public cloud, although I have several ideas I’d like to try myself!

Latency is also a significant factor in the performance of network filesystems. Existing workflows that depend on writing data to shared storage typically need some modification to run efficiently on cloud-HPC. This is not generally a show stopper, but something to bear in mind nonetheless. On the flip side, distributed storage approaches such as Hadoop that are becoming increasingly used in HPC are perfectly suited to cloud infrastructure.

As demonstrated by Cycle Computing, cloud-HPC clearly trumps trad-HPC in scalability (expansion to meet demand). Another advantage of cloud-HPC’s scalability and reduced provisioning time is the ability to dynamically adjust the size of the cluster pool to the size of the job queue (see StarCluster, for example). This makes cloud-HPS ideal for CPU-bound “embarrassingly parallel” workloads that are very common in bioinformatics, and dominate HPC workloads generally by some accounts.

What really sets cloud-HPC apart, from a workflow developer’s standpoint, is self-service management and administration. This can result in huge productivity gains from individual developers, especially when using configuration management software such as Opscode Chef.

One should also bear in mind that the world does not stand still. Cloud providers are enhancing their HPC offerings all the time, through the addition of GPU instances and low latency network options for example. At the same time, academic communities are embracing virtualisation and cloud technologies for their next generation HPC infrastructures.

To a large extent the choice of trad-HPC vs. cloud-HPC will depend on the exact nature of the workload. However, for the majority of CPU-bound workflows at play in bioinformatics today, the superior scalability and sheer convenience/ease of use together make cloud-HPC a compelling choice.

Eagle presenting at Introduction to AWS

On 30th May Wednesday in Cambridge, at the Introduction to Amazon Web Services (AWS), Will Spooner will be presenting a case study on how Eagle used AWS to transform Unilever informatics and speed up their research by 20 times.

The event will take half a day, beginning with an introduction of AWS and how to develop software on it. Another AWS customer showcase will follow after Will's and then Amazon experts will demonstrate best practices on building software solutions that use their resources. A beer session will follow where attendees will have the chance to meet one-on-one with AWS gurus who will be happy to share useful and efficient technical ideas.

Registration for the event is easy – less hassle than taking a walk on the Cambridge DNA cycle path! As a bonus it's also free and includes a free beer.

We hope to see you there!

Image © Copyright Keith Edkins and licensed for reuse under the Creative Commons Licence.

Ensembl and AWS: an Industrial Perspective

The excellent AWS Case Study on Ensembl is of particular interest to us at Eagle, not only because Glenn Proctor has since joined us as Principal Consultant, but also because we also have extensive experience in deploying Ensembl on EC2.

Our “Elastic Ensembl” was deployed in it’s most sophisticated guise for Pistoia Sequence Services Phase 1, in collaboration with Cognizant. The brief was to develop a secure, scalable, hosted instance of Ensembl for pharma users. Single sign on (SSO) user authentication was achieved by lodging Ensembl behind an OpenAM user agent and effortless auto-scaling and load balancing came courtesy of the Zeus Traffic Manager. Apart from the ‘industrial’ platform on which it was deployed, the browser itself was vanilla Ensembl code, with databases loaded from the Public Data Sets on AWS.

This project ably demonstrates the use of AWS for deploying complex web resources, and integrating them with best-of-breed commercial and open source offerings from the likes of Zeus and ForgeRock. It also shows how fantastic it is to have vital public resources like Ensembl available ‘locally’ (in AWS terms) via the Public Data Sets.

– Will Spooner

Genetics ‘cloud’ to create new opportunities for researchers and clinicians

Gene sequencing and analysis could be dramatically speeded up, leading to patients receiving a quicker and more accurate diagnosis, thanks to research led by Eagle Genomics Ltd.

Using cloud computing technology, the researchers have found they can slash the amount of time it takes to store the huge amounts of information produced when individual genes are sequenced and analysed.

Whereas at the moment this process can take up to three months, the scientists believe their new technique could mean results are produced in about a week.

Eagle Genomics, a leading open-source bioinformatics service provider, is carrying out the research in collaboration with The University of Manchester, and Cytocell Ltd., with assistance from NGRL, based at the Central Manchester University Hospitals NHS Foundation Trust, and the NIHR Manchester Biomedical Research Centre. The £500,000 project is part-funded by the UK’s national innovation agency, the Technology Strategy Board.

Access to the information analysed and stored by the cloud will enable medical researchers who are developing and testing new treatments to compare large amounts of information and find common genetic links.

The technology will also help clinicians to look at an individual patient’s genetic make-up to aid diagnosis and ongoing treatment.

Rather than simply testing a patient for one suspected condition, using the cloud technology could allow clinicians to test for a much wider range of complaints.

Currently, the NHS IT systems do not have the resources to cope with the huge demands required. The cloud system can be accessed from a separate site, away from hospitals, freeing up space.

The project will build upon the success of the Taverna Workflow Management System software developed by Professor Carole Goble’s myGrid team at The University of Manchester. Eagle Genomics will work with the University to adapt Taverna to allow non-IT experts to easily add and extract information and share it with their colleagues.

“Taverna is ideal for this project because it allows you to systematically automate the analysis processes of expert geneticists and make them easily available for other to use at the press of a button” said Professor Andy Brass of The University of Manchester.

Example applications identified and described by NGRL and Cytocell will provide a significant and valuable resource to help develop and demonstrate the efficacy of the resulting system.

“Genetic sequencing is an increasingly important diagnostic tool as well as being fundamental to many areas of research,” said Professor Graeme Black, Director of the NIHR Manchester Biomedical Research Centre and a consultant at Manchester Royal Eye Hospital.  “By storing genetic data in the ‘cloud’ indefinitely, we can use it for research studies and also to help clinicians to decide if medical conditions, that patients develop at any stage, may be linked to their genes.”

Abel Ureta-Vidal, CEO of Eagle Genomics Ltd., added: “Thanks to funding from the Technology Strategy Board, this project is looking at ways in which genetic data can be securely and confidentially stored, accessed and analysed only by approved users.”

The project, which started in July 2011, is on target for completion of a fully functional system with an initial selection of analyses available by December 2012. 

Notes to editors:

Eagle Genomics Ltd. is an outsourced bioinformatics services and software company specialising in genome content management and the provision of open-source solutions. Eagle consistently delivers quality and value-for-money for customers across the biotech sector, combining cloud and NGS expertise with a track record in building scalable, efficient genomics analysis workflows.

The University of Manchester, a member of the Russell Group, is one of the most popular universities in the UK. According to the results of the 2008 Research Assessment Exercise, The University of Manchester is now one of the country’s major research universities, rated third in the UK in terms of ‘research power’.

The myGrid team produce and use a suite of tools designed to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry.

Cytocell Ltd. is a leading European developer and manufacturer of FISH probes for use in both routine cytogenetics and in the analysis and classification of Cancers. The Company’s products are well established in cytogenetics as the Company is celebrating its 20th year of supplying them to this market.

The NIHR Manchester Biomedical Research Centre was created by the National Institute for Health Research in 2008 to effectively move scientific breakthroughs from the laboratory. As a partnership between Central Manchester University Hospitals NHS Foundation Trust and The University of Manchester, the Biomedical Research Centre is designated as a specialist centre of excellence in genetics and developmental medicine.

Central Manchester University Hospitals NHS Foundation Trust is a leading provider of specialist healthcare services in Manchester, treating more than a million patients every year. Its five specialist hospitals (Manchester Royal Infirmary, Saint Mary’s Hospital, Royal Manchester Children’s Hospital, Manchester Royal Eye Hospital and the University Dental Hospital of Manchester) are home to hundreds of world class clinicians and academic staff committed to finding patients the best care and treatments.

NGRL provides dedicated support to UK genetic testing centres, focusing on health and bioinformatics, with the aim of bringing new technologies into diagnostic genetics services to the benefit of NHS patients.

The Technology Strategy Board is a business-led government body which works to create economic growth by ensuring that the UK is a global leader in innovation. Sponsored by the Department for Business, Innovation and Skills (BIS), the Technology Strategy Board brings together business, research and the public sector, supporting and accelerating the development of innovative products and services to meet market needs, tackle major societal challenges and help build the future economy.

For further information please contact:

Richard Holland, Operations and Delivery Director
Eagle Genomics Ltd., Babraham Research Campus, Cambridge
+44 (0)1223 654481 x3 /