Identifying and handling data bias within primary healthcare data using synthetic data generators

Barbara Draghi; Zhenchen Wang; Puja Myles; Allan Tucker

doi:10.1016/j.heliyon.2024.e24164

Identifying and handling data bias within primary healthcare data using synthetic data generators

Heliyon. 2024 Jan 10;10(2):e24164. doi: 10.1016/j.heliyon.2024.e24164. eCollection 2024 Jan 30.

Authors

Barbara Draghi^{1

2}, Zhenchen Wang¹, Puja Myles¹, Allan Tucker²

Affiliations

¹ Medicines and Healthcare products Regulatory Agency, London, UK.
² Brunel University London, London, UK.

Abstract

Advanced synthetic data generators can simulate data samples that closely resemble sensitive personal datasets while significantly reducing the risk of individual identification. The use of these advanced generators holds enormous potential in the medical field, as it allows for the simulation and sharing of sensitive patient data. This enables the development and rigorous validation of novel AI technologies for accurate diagnosis and efficient disease management. Despite the availability of massive ground truth datasets (such as UK-NHS databases that contain millions of patient records), the risk of biases being carried over to data generators still exists. These biases may arise from the under-representation of specific patient cohorts due to cultural sensitivities within certain communities or standardised data collection procedures. Machine learning models can exhibit bias in various forms, including the under-representation of certain groups in the data. This can lead to missing data and inaccurate correlations and distributions, which may also be reflected in synthetic data. Our paper aims to improve synthetic data generators by introducing probabilistic approaches to first detect difficult-to-predict data samples in ground truth data and then boost them when applying the generator. In addition, we explore strategies to generate synthetic data that can reduce bias and, at the same time, improve the performance of predictive models.

Keywords: Bayesian networks; Data bias; Machine learning; Over-sampling; Synthetic data generators.