Composite attention mechanism network for deep contrastive multi-view clustering

Neural Netw. 2024 May 3:176:106361. doi: 10.1016/j.neunet.2024.106361. Online ahead of print.

Abstract

Contrastive learning-based deep multi-view clustering methods have become a mainstream solution for unlabeled multi-view data. These methods usually utilize a basic structure that combines autoencoder, contrastive learning, or/and MLP projectors to generate more representative latent representations for the final clustering stage. However, existing deep contrastive multi-view clustering ignores two key points: (i) the latent representations projecting from one or more layers of MLP or new representations directly obtained from autoencoder fail to mine inherent relationship inner-view or cross-views; (ii) more existing frameworks only employ a one or dual-contrastive learning module, i.e., view- or/and category-oriented, which may result in the lack of communication between latent representations and clustering assignments. This paper proposes a new composite attention framework for contrastive multi-view clustering to address the above two challenges. Our method learns latent representations utilizing composite attention structure, i.e., Hierarchical Transformer for each view and Shared Attention for all views, rather than simple MLP. As a result, the learned representations can simultaneously preserve important features inside the view and balance the contributions across views. In addition, we add a new communication loss in our new dual contrastive framework. The common semantics will be brought into clustering assignments by pushing clustering assignments closer to the fused latent representations. Therefore, our method will provide a higher quality of clustering assignments for the segmentation problem of unlabeled multi-view data. The extensive experiments on several real data demonstrate that the proposed method can achieve superior performance over many state-of-the-art clustering algorithms, especially the significant improvement of an average of 10% on datasets Caltech and its subsets according to accuracy.

Keywords: Contrastive learning; Multi-view clustering; Transformer.