Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts

BMC Bioinformatics. 2005 Apr 22:6:103. doi: 10.1186/1471-2105-6-103.

Abstract

Background: Text-mining can assist biomedical researchers in reducing information overload by extracting useful knowledge from large collections of text. We developed a novel text-mining method based on analyzing the network structure created by symbol co-occurrences as a way to extend the capabilities of knowledge extraction. The method was applied to the task of automatic gene and protein name synonym extraction.

Results: Performance was measured on a test set consisting of about 50,000 abstracts from one year of MEDLINE. Synonyms retrieved from curated genomics databases were used as a gold standard. The system obtained a maximum F-score of 22.21% (23.18% precision and 21.36% recall), with high efficiency in the use of seed pairs.

Conclusion: The method performs comparably with other studied methods, does not rely on sophisticated named-entity recognition, and requires little initial seed knowledge.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Automation
  • Computational Biology / methods*
  • Computer Graphics
  • Computers
  • Database Management Systems
  • Databases, Bibliographic
  • Databases, Genetic
  • Gene Expression Regulation, Neoplastic
  • Genome
  • Humans
  • Information Storage and Retrieval
  • Information Systems
  • MEDLINE
  • Natural Language Processing
  • Neoplasms / genetics
  • Neural Networks, Computer
  • Pattern Recognition, Automated
  • Programming Languages
  • Reproducibility of Results
  • Software Design
  • Software*
  • Terminology as Topic
  • Vocabulary, Controlled