Open-access synthetic spike-in mRNA-seq data for cancer gene fusions

BMC Genomics. 2014 Sep 30;15(1):824. doi: 10.1186/1471-2164-15-824.

Abstract

Background: Oncogenic fusion genes underlie the mechanism of several common cancers. Next-generation sequencing based RNA-seq analyses have revealed an increasing number of recurrent fusions in a variety of cancers. However, absence of a publicly available gene-fusion focused RNA-seq data impedes comparative assessment and collaborative development of novel gene fusions detection algorithms. We have generated nine synthetic poly-adenylated RNA transcripts that correspond to previously reported oncogenic gene fusions. These synthetic RNAs were spiked at known molarity over a wide range into total RNA prior to construction of next-generation sequencing mRNA libraries to generate RNA-seq data.

Results: Leveraging a priori knowledge about replicates and molarity of each synthetic fusion transcript, we demonstrate utility of this dataset to compare multiple gene fusion algorithms' detection ability. In general, more fusions are detected at higher molarity, indicating that our constructs performed as expected. However, systematic detection differences are observed based on molarity or algorithm-specific characteristics. Fusion-sequence specific detection differences indicate that for applications where specific sequences are being investigated, additional constructs may be added to provide quantitative data that is specific for the sequence of interest.

Conclusions: To our knowledge, this is the first publicly available synthetic RNA-seq data that specifically leverages known cancer gene-fusions. The proposed method of designing multiple gene-fusion constructs over a wide range of molarity allows granular performance analyses of multiple fusion-detection algorithms. The community can leverage and augment this publicly available data to further collaborative development of analytical tools and performance assessment frameworks for gene fusions from next-generation sequencing data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Carcinogenesis / genetics
  • Cell Line, Tumor
  • Gene Fusion*
  • Genes, Neoplasm / genetics*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Polyadenylation
  • RNA, Messenger / genetics
  • RNA, Messenger / metabolism
  • Sequence Analysis, RNA / methods*

Substances

  • RNA, Messenger