An analysis of the Protein Data Bank in search of temporal and global trends

Bioinformatics. 1999 Oct;15(10):807-31. doi: 10.1093/bioinformatics/15.10.807.

Abstract

Motivation: Biological databases, with their rapidly expanding contents, are indispensable tools in the quest to understand more about biological function. However, a serious user of a database that comprises a large collection of data, collected over a long period, will likely be struck by the inconsistency in reporting individual items of data. This paper takes a critical look at the Protein Data Bank (PDB) to explore the seriousness of the problem in one particular data set and to explore the implications to those actively engaged in comparative analysis of these data.

Results: Averaged over the complete corpus, the stereochemical quality of atomic models has, in the past few years, moved towards ideal values. At the same time, there are inconsistencies in how data are reported. Water content is not reported consistently and the percent of data collected when reporting the high-resolution shell varies, detracting from the value of resolution as a yardstick for assessing the quality of a structure. A more detailed analysis of these inconsistencies is hampered by the lack of machine-readable experimental data. To the user of macromolecular structure data, this suggests that structural details beyond the standard quality measures of resolution and R value should be considered when using coordinate sets for further derivation or in inferring biological function. To the curators of the PDB, this suggests the need to capture more of the experimental data associated with the experiment in a way that permits straightforward parsing.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Computational Biology
  • Databases, Factual* / trends
  • Humans
  • Internet
  • Models, Molecular
  • Protein Conformation
  • Proteins / chemistry*
  • Stereoisomerism
  • Time Factors

Substances

  • Proteins