Text-mined phenotype annotation and vector-based similarity to improve identification of similar phenotypes and causative genes in monogenic disease patients

Hum Mutat. 2018 May;39(5):643-652. doi: 10.1002/humu.23413. Epub 2018 Mar 15.

Abstract

The genetic diagnosis of rare monogenic diseases using exome/genome sequencing requires the true causal variant(s) to be identified from tens of thousands of observed variants. Typically a virtual gene panel approach is taken whereby only variants in genes known to cause phenotypes resembling the patient under investigation are considered. With the number of known monogenic gene-disease pairs exceeding 5,000, manual curation of personalized virtual panels using exhaustive knowledge of the genetic basis of the human monogenic phenotypic spectrum is challenging. We present improved probabilistic methods for estimating phenotypic similarity based on Human Phenotype Ontology annotation. A limitation of existing methods for evaluating a disease's similarity to a reference set is that reference diseases are typically represented as a series of binary (present/absent) observations of phenotypic terms. We evaluate a quantified disease reference set, using term frequency in phenotypic text descriptions to approximate term relevance. We demonstrate an improved ability to identify related diseases through the use of a quantified reference set, and that vector space similarity measures perform better than established information content-based measures. These improvements enable the generation of bespoke virtual gene panels, facilitating more accurate and efficient interpretation of genomic variant profiles from individuals with rare Mendelian disorders. These methods are available online at https://atlas.genetics.kcl.ac.uk/~jake/cgi-bin/patient_sim.py.

Keywords: HPO; Mendelian; genetic diagnosis; monogenic; phenotype similarity; rare disease; variant prioritization; whole exome sequencing; whole genome sequencing.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computer Simulation
  • Data Curation*
  • Data Mining*
  • Databases, Genetic*
  • Disease / genetics*
  • Genetic Predisposition to Disease*
  • Humans
  • Logistic Models
  • Phenotype
  • Probability
  • Search Engine