Methodology to identify a gene expression signature by merging microarray datasets

Comput Biol Med. 2023 Jun:159:106867. doi: 10.1016/j.compbiomed.2023.106867. Epub 2023 Apr 11.

Abstract

A vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosis and prognosis of diseases, as well as in the therapeutic response to drugs. However, most of the available datasets are composed of a reduced number of samples, leading to low statistical, predictive and generalization power. One way to overcome this problem is by merging several microarray datasets into a single dataset, which is typically a challenging task. Statistical methods or supervised machine learning algorithms are usually used to determine gene expression signatures. Nevertheless, statistical methods require an arbitrary threshold to be defined, and supervised machine learning methods can be ineffective when applied to high-dimensional datasets like microarrays. We propose a methodology to identify gene expression signatures by merging microarray datasets. This methodology uses statistical methods to obtain several sets of differentially expressed genes and uses supervised machine learning algorithms to select the gene expression signature. This methodology was validated using two distinct research applications: one using heart failure and the other using autism spectrum disorder microarray datasets. For the first, we obtained a gene expression signature composed of 117 genes, with a classification accuracy of approximately 98%. For the second use case, we obtained a gene expression signature composed of 79 genes, with a classification accuracy of approximately 82%. This methodology was implemented in R language and is available, under the MIT licence, at https://github.com/bioinformatics-ua/MicroGES.

Keywords: Autism spectrum disorder; Gene expression signature; Heart failure; LSVM; Microarray data; Neural network; Random forest.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Autism Spectrum Disorder*
  • Gene Expression Profiling* / methods
  • Humans
  • Oligonucleotide Array Sequence Analysis / methods
  • Transcriptome