Self-Supervised Learning by Cross-Modal Audio-Video.. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g. audio).

Self-Supervised Learning by Cross-Modal Audio-Video.
Self-Supervised Learning by Cross-Modal Audio-Video. from i1.rgstatic.net

This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality.