FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data

bioRxiv [Preprint]. 2024 Mar 27:2024.03.26.586797. doi: 10.1101/2024.03.26.586797.

Abstract

Machine learning (ML) for protein design requires large protein fitness datasets generated by high-throughput experiments for training, fine-tuning, and benchmarking models. However, most models do not account for experimental noise inherent in these datasets, harming model performance and changing model rankings in benchmarking studies. Here, we develop FLIGHTED, a Bayesian method for generating fitness landscapes with calibrated errors from noisy high-throughput experimental data. We apply FLIGHTED to single-step selection assays such as phage display and to a novel high-throughput assay DHARMA that ties fitness to base editing activity. Our results show that FLIGHTED robustly generates fitness landscapes with accurate errors. We demonstrate that FLIGHTED improves model performance and enables the generation of protein fitness datasets of up to 106 variants with DHARMA. FLIGHTED can be used on any high-throughput assay and makes it easy for ML scientists to account for experimental noise when modeling protein fitness.

Keywords: Bayesian inference; Machine learning; fitness landscape; protein design.

Publication types

  • Preprint