Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer

Jipeng Qiang; Wei Ding; Marieke Kuijjer; John Quackenbush; Ping Chen

doi:10.1109/access.2020.2982569

Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer

IEEE Access. 2020:8:67775-67789. doi: 10.1109/access.2020.2982569. Epub 2020 Mar 26.

Authors

Jipeng Qiang^{1

2}, Wei Ding², Marieke Kuijjer³, John Quackenbush⁴, Ping Chen²

Affiliations

¹ Department of Computer Science, Yangzhou University, Yangzhou 225127, China.
² Department of Computer Science, University of Massachusetts Boston, Boston, MA 02125, USA.
³ Centre for Molecular Medicine Norway, University of Oslo Faculty of Medicine, 0318 Oslo, Norway.
⁴ Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA.

Abstract

In this paper, given data with high-dimensional features, we study this problem of how to calculate the similarity between two samples by considering feature interaction network, where a feature interaction network represents the relationship between features. This is different from some traditional methods, those of which learn similarities based on a sample network that represents the relationship between samples. Therefore, we propose a novel network-based similarity metric for computing the similarity between samples, which incorporates the knowledge of feature interaction network, in order to overcome the data sparseness problem. Our similarity metric uses a new Feature Alignment Similarity measure, which does not directly compute the similarities among samples, but projects each sample into a feature interaction network and measures the similarities between two samples using the similarities between the vertices of the samples in the network. As such, when two samples do not share any common features, they are likely to have higher similarity values when their features share the similar network regions. For ensuring that the metric is useful in a real-world application, we apply our metric to discover subtypes in tumor mutational data by incorporating the information of the gene interaction network. Our experimental results from using synthetic data and real-world tumor mutational data show that our approach outperforms the top competitors in cancer subtype discovery. Furthermore, our approach can identify cancer subtypes that cannot be detected by other clustering algorithms in real cancer data.

Keywords: Cancer subtype; feature interaction network; similarity metric; somatic mutational data.

Grants and funding

R35 CA220523/CA/NCI NIH HHS/United States