Background
Integrating genetic data from diverse populations is crucial for improving our understanding of how genetics contribute to disease and avoiding health disparities when implementing genetic findings in healthcare practices. Nevertheless, current reference panels either have a limited representation of populations or consist of small sample sizes. We used the UK Biobank as a reference dataset to group genetically similar individuals globally. Utilizing the diverse population backgrounds in the UK Biobank, we aim to improve population representation and mitigate reference population bias arising from the absence of populations closely related to individuals of interest in existing reference panels.
Methods
Information on countries of birth and ethnic backgrounds from the UK Biobank dataset was combined with findings from previous genetic structure studies to infer genetically similar population labels. We then used a random forest model trained on the principal components of genetic data to identify each individual’s most genetically similar population. The prediction model's performance was validated using data from the 1000 Genomes Project and the CARTaGENE biobank.
Results
We identified good quality labels for a wide range of populations represented in the UK. Our approach allows more detailed genetically similar clustering than is currently possible with resources such as 1000 Genomes, totaling 19 different populations worldwide. Our model demonstrated medium to high precision and recall for the majority of labeled populations, although distinguishing closely related groups yielded lower figures. For example, we found that 519 people in CARTaGENE were most genetically similar to the Middle Eastern reference set that we derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), resulting in a precision of 81.1% and a recall rate of 97.0% when compared to demographic data.
Our method can facilitate downstream genetic analyses, such as genome-wide association studies or polygenic risk scores, in underrepresented populations.