Oral Presentation GENEMAPPERS 2024

Fine mapping known coronary artery disease loci in UK Biobank’s whole genome sequencing data using research analysis platform parallelisation orchestration engine template (RAPpoet) (#27)

Mitchell J O'Brien 1 , Letitia M.F Sng 1 , Anubhav Kaphle 1 , Brendan Hosking 1 , Roc Reguant 1 , Yatish Jain 1 , Johan Verjans 2 , Natalie A Twine 1 , Denis C Bauer 1
  1. Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NEW SOUTH WALES, Australia
  2. South Australian Health and Medical Research Institute, Adelaide, South Australia, Australia

The UK Biobank (UKB) has recently made available whole genome sequencing (WGS) data from half a million individuals, totalling 27.5 petabytes, through their Research Analysis Platform (RAP). The cloud-based platform securely stores genomic files alongside other health data, streamlining access and eliminating the need for data transfer. The RAP signifies a change in conducting genomic analyses towards the cloud and calls for an adjustment of workflows to accommodate this paradigm shift in data sharing approach.

Using the RAP and leveraging the UKB WGS dataset, we conducted an association analysis of a coronary artery disease (CAD) cohort. Our workflow validated and fine-mapped the 9p21.3 CAD risk locus, using POLYFUN to pinpoint rs10757274 as the primary causal SNP within this locus. We compared machine-learning (ML) association analysis approaches, REGENIE and VariantSpark, to traditional logistic regression, revealing heightened sensitivity for the ML approaches, including the identification of known CAD SNP rs28451064 in the 21q22.11 risk locus. These findings underscore the efficacy of advanced computational techniques and cloud-based resources in mega-biobank analyses.  

To enable efficiency in our workflow, we designed a scalable orchestration engine called RAP parallelisation orchestration engine template (RAPpoet). By optimising compute architecture on the UKB’s RAP, we achieved substantial resource optimisation (44% cost reduction) and performance gains (40% speedup). Alongside these achievements, we identified three crucial considerations for researchers adopting cloud-based workflows focusing on parallelisation, architecture tuning, and privacy. 

This research represents the first association study conducted on the world's largest public WGS cohort from the UKB and lays the groundwork for utilising mega-biobank-sized data with scalable, cost-effective cloud computing solutions.