Information-incorporated sparse convex clustering for disease subtyping

Bioinformatics. 2023 Jul 1;39(7):btad417. doi: 10.1093/bioinformatics/btad417.

Abstract

Motivation: Heterogeneity in human diseases presents clinical challenges in accurate disease characterization and treatment. Recently available high throughput multi-omics data may offer a great opportunity to explore the underlying mechanisms of diseases and improve disease heterogeneity assessment throughout the treatment course. In addition, increasingly accumulated data from existing literature may be informative about disease subtyping. However, the existing clustering procedures, such as Sparse Convex Clustering (SCC), cannot directly utilize the prior information even though SCC produces stable clusters.

Results: We develop a clustering procedure, information-incorporated Sparse Convex Clustering, to respond to the need for disease subtyping in precision medicine. Utilizing the text mining approach, the proposed method leverages the existing information from previously published studies through a group lasso penalty to improve disease subtyping and biomarker identification. The proposed method allows taking heterogeneous information, such as multi-omics data. We conduct simulation studies under several scenarios with various accuracy of the prior information to evaluate the performance of our method. The proposed method outperforms other clustering methods, such as SCC, K-means, Sparse K-means, iCluster+, and Bayesian Consensus Clustering. In addition, the proposed method generates more accurate disease subtypes and identifies important biomarkers for future studies in real data analysis of breast and lung cancer-related omics data. In conclusion, we present an information-incorporated clustering procedure that allows coherent pattern discovery and feature selection.

Availability and implementation: The code is available upon request.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bayes Theorem
  • Cluster Analysis
  • Data Analysis
  • Humans
  • Multiomics
  • Neoplasms*
  • Precision Medicine*