M1. Big Data and Visual Analytics
by Tomasz Bednarz, Piotr Szul
This workshop introduces various Big Data frameworks. It starts with general introduction to Big Data concepts, and applications. It showcases Hadoop Distributed File System (HDFS) and Hadoop MapReduce as an adopted solution by industry for solving big data analytics cases. Also, you will see how to upload and operate on files in HDFS, and step by step develop understanding how to write MapReduce compute code. Various design patterns will be presented, and other frameworks introduced: Apache Pig, Apache Spark, Mahout, and Scoop. Example of running recommender system using Big Data framework will be demonstrated.
M2. Bioinformatics and Big Data
by Annette Mcgrath CSIRO
Driven by advances in measurement and data acquisition technologies that allow very substantial amounts of data to be produced daily and aided by precipitous drops in the price of this data, biology had become a data science. As these technologies improve, the volume of this data is exceeding the capacity of computational hardware using conventional methods to analyse this data, thus creating a ‘Big Data’ problem in Bioinformatics. Nonetheless, the availability of this data provides bioinformaticians with an unprecedented opportunity as effective analysis and interpretation of Big Data offers new ways to uncover hidden patterns in data and to build better predictive models.
Bioinformatics is facing challenges in both managing, storing, processing, analysing and integrating different types of molecular biological information to gain insights that will lead to new discoveries in a number of fields and to new therapies in human health.
This workshop will provide an overview of the Big Data problem in Bioinformatics and will present some of the ways in which bioinformatics researchers are using alternative approaches and methodologies to deal with this problem.
Speaker: Annette McGrath, CSIRO
Title: Accelerating alignment: can we gain from using GPU-based tools in bioinformatics?
Abstract: Sequence alignment of very large numbers of generally, short sequence reads generated from the high-throughput but low-cost next generation sequencing (NGS) platform is the first step in many NGS analysis workflows. GPU (Graphic Processing Units)-based short-read aligners have been recently developed with the aim to accelerate the alignment process, when compared to CPU-based tools, and without the loss of alignment accuracy. GPU-based aligners take advantage of the nature of parallel mode in GPU processors, which can execute hundreds of tasks with the same instruction simultaneously. We have carried out a comparative study to investigate the capabilities of GPUs and GPU-aware short-read aligners against CPU ones to understand in what circumstances it would be beneficial to use GPUs and what limitations there are. This presentation outlines the datasets and tools we chose, the experimental workflows, outcomes and related experience we have learnt.
Annette McGrath has almost 20 years’ experience as a bioinformatician. She graduated from the National University of Ireland with a PhD in molecular biology and from the University of Queensland with a graduate diploma in statistics. Following postdoctoral work in bioinformatics on multiple sequence alignment, she worked for 3 years as a staff scientist and team leader in a biotech company in Auckland, New Zealand. She then spent 8 years as Head of Bioinformatics at the Australian Genome Research Facility, followed by Head of Bioinformatics at Queensland Facility for Advanced Bioinformatics in 2010. In 2011 she was recruited to establish and lead the CSIRO Bioinformatics Core, dedicated to enhancing capability in bioinformatics across CSIRO. She is a Principal Research Scientist and Team Leader in life science informatics with interests in the application of ‘omics technologies and big data and with a passion for bioinformatics education and training.
Speaker: Andrew George, CSIRO
Title: R with big (genomic) data: friend or foe
Abstract: R is an amazingly useful language for managing, analysing, and visualising data. Rs strengths are its popularity, its abundance of add-on packages, and that it is free. One of its weaknesses though is that R is memory bound. R does not cope well with data that cannot fit into a machine’s RAM. All is not lost though. I will go through the five steps as a devoted R user you can take when faced with big data. I will also talk about why you might want to use R as a Bioinformatician and what resources are available to you. I will conclude my talk with how I solved my big data problem that arose when developing software for genome-wide association mapping.
Andrew George has over 15 years’ experience as a statistician. His career has been focused on the development of new statistical methods for uncovering the genetic basis of heritable traits in animals, plants, and humans. He obtained his PhD from Queensland University of Technology in Brisbane, Australia, and completed postdoctoral fellowships at Roslin Institute, Scotland and at the University of Washington, USA. He was a lecturer in the Department of Biostatistics at the University of Iowa for four years before taking up an appointment with the CSIRO as a principal research scientist. Andrew is now the group leader of the statistical learning group at the CSIRO with interests in the statistical challenges of analysing big data.
Speaker: Aidan O’Brien, CSIRO
Title: VariantSpark: applying Spark-based machine learning methods to genomic information
Genomic information is increasingly being used for medical research, giving rise to the need for efficient and scalable analysis methodology able to cope with thousands of individuals and millions of variants. Catering for this need, we developed VariantSpark, a Hadoop/Spark framework, to provide a means of parallelisation for population-scale bioinformatics tasks. VariantSpark offers an interface between Variant Call Format (VCF) and Mutation Annotation Format (MAF) files, and Spark ML’s machine learning pipelines, providing seamless genome-wide sampling of variants and also a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we cluster of more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach developed by the Global Alliance for Genomics and Health, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
Aidan O’Brien graduated from the University of Queensland with a Bachelor of Biotechnology (1st class honours) in 2013. With Dr. Timothy Bailey as his honours supervisor, he developed GT-Scan, a CRISPR target predictor. GT-Scan is currently available freely through EMBL Australia. Aidan is now based at CSIRO with the transformational bioinformatics team, where he developed VariantSpark, which applies BigData machine learning algorithms to genomic data. Aidan has 4 journal publication (3 first author) with 20 citations (h-index 2). He received the “Best student and postdoc” award at CSIRO in 2015 and attracted $180K in funding to date as AI. With an increasing interest in machine learning and its applications to genomics, he is currently interested in pursuing a PhD in this field.
Speaker: Kim-Anh Lê Cao (University of Queensland Diamantina Institute)
Title: Multi- ‘omics data integration and biomarker discovery: a multivariate statistical perspective
The advent of high throughput technologies has led to a wealth of publicly available biological data coming from different sources, the so-called ‘omics data (transcriptomics for the study of transcripts, proteomics for proteins, metabolomics for metabolites, etc). Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Recent advances in multivariate statistics now enable to integrate data and select high correlated molecular features within and between ‘omics data sets, while achieving dimension reduction and producing attractive visualisations. During this presentation I will discuss the benefits of using multivariate integrative techniques for such complex problems arising from high-throughput molecular biology, and discuss current and future analytical challenges lying ahead.
Dr Kim-Anh Lê Cao was awarded her PhD in 2008 at Université de Toulouse, France. She then moved to Australia as a postdoctoral fellow at the University of Queensland.
She is now working at the University of Queensland Diamantina Institute as a National Health and Medical Research Council (NHMRC) Career Development Fellow. Her team focuses on the development of statistical approaches for the analysis and the integration of large biological data sets for studies in several types of cancer, and diseases involving the immune system, including arthritis, chronic infections, and diabetes.
Since 2009, her team has been working on developing a statistical software dedicated to the integrative analysis of `omics’ data, to help researchers make sense of biological big data (http://www.mixOmics.org).
Speaker: John Pearson QIMR
Title: Managing Genome-Scale Data
As the cost of genome sequencing decreases, the amount of research sequencing increases and the size of the resultant datasets grows. Managing, analysing and protecting the data are non-trivial problems and there are pitfalls aplenty. John has been part of the largest human genome sequencing projects in Australia and the world and will use the Australian International Cancer Genome Consortium (ICGC) projects as case studies – what was easy, what was hard and what he would never do again.
John Pearson is a trained computer scientist and has spent over 20 years creating software for medical researchers. John is the head of the genome informatics group at QIMR Berghofer. He has spent the past 8 years leading genome informatics teams working with next-generation sequencing in the USA and Australia. John led the QIMR Genetic Epidemiology software team prior to moving to the NIH in Bethesda, MD where he was the lead programmer in the NHGRI Bioinformatics and Scientific Programming Core. In 2003, John left the NIH to become a founding Faculty member at the Translational Genomics Research Institute (TGen) and in 2010 he returned to Australia as the Senior Bioinformatics Manager for Australia’s International Cancer Genome Consortium Project. John has held software development grants from the American Cancer Society, the National Institutes of Health and Microsoft and in 2009 was the Principal Investigator on a successful bid for one of 5 In Silico Research Centers of Excellence (ISCRE) awarded by the National Cancer Institute. He has published in Science (1), New England Journal of Medicine (1), Nature Methods (3), Nature Communications (1), Nature Genetics (1), and Nature (6).
A1. Multi-Dimensional Data Visualisation with Parallel Coordinates
by Julian Heinrich | CSIRO Data61
Today, multidimensional data is ubiquitous. Purchasing a new camera, analysing the stock market, or extracting patterns from gene expression data requires both non-experts and data scientists to make sense of data with muliple dimensions or variables.
In this workshop, I will give an introduction to parallel coordinates – a visualisation technique for multidimensional data. In the first part of the workshop, I will provide an overview of the basic geometry and the patterns that emerge in parallel coordinates, and I will disuss the good and the bad parts of this visualisation technique.
A2. OpenGL for Big Data Visualisation
by Tomasz Bednarz, Daniel Filonik, Xavier Ho, and Steve Psaltis
In modern data science, accelerated visualisation techniques are required to explore data sets in a robust way. This workshop will provide introduction to Open Graphics Language (OpenGL), which is the most widely adopted 2D and 3D graphics API in the industry, bringing thousands of applications to a wide variety of computer platforms, and can be used effectively to visualise big data.
OpenGL would be essential for effective use of big data, due to its real-time interactive capabilities and accelerated rendering. It allows unlocking more value from big data, as it brings possibility of creating new modern analytics tools, user interfaces, interlinking creation of the models based on users’ inputs, etc. Also, in order to effectively use Virtual Reality and Augmented Reality, OpenGL is a key enabler.
In addition, OpenGL talks to compute platforms such as OpenCL or CUDA, and enables developers/scientists to create high-performance, visually compelling graphics software applications. The workshop will include introductions and hands-on exercises to basic concepts of hardware rendered graphics, shader language and also web based version of OpenGL, called WebGL.