Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations

David Bonet; May Levin; Daniel Mas Montserrat; Alexander G Ioannidis

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations

Pac Symp Biocomput. 2024:29:404-418.

Authors

David Bonet¹, May Levin, Daniel Mas Montserrat, Alexander G Ioannidis

Affiliation

¹ Stanford University, Stanford, CA, US2Universitat Politècnica de Catalunya, Barcelona, Spain.

PMID: 38160295
PMCID: PMC10799683

Abstract

Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.

MeSH terms

Computational Biology*
Humans
Machine Learning*
Minority Groups
Phenotype
UK Biobank
White People

Grants and funding

R01 HG010140/HG/NHGRI NIH HHS/United States