Abstract:To address the challenges of data scarcity, weak model generalization, and low deployment efficiency in modern enterprise networks, this paper proposes a unified framework for fault prediction and security threat identification based on Multimodal Self-Supervised Knowledge Distillation (MM-SSKD). Specifically, we design an Augmented Multimodal Masked Autoencoder (Augmented MM-MAE) with cross-modal consistency constraints, and introduce random modality dropout to enhance robustness under missing modalities. In addition, we propose Class-Conditional Correlation Alignment (C-CORAL), which achieves class-level second-order statistical alignment through confidence-based filtering and class re-weighting. On the target domain, multi-task joint learning with limited annotations is performed, combined with temperature scaling for probability calibration and threshold-based decision-making. Experimental results demonstrate that compared with the best public baselines, our approach reduces RMSE by about 15.8% and improves F1-score by about 4.0%, with greater advantages under low-label conditions, while maintaining stability in both missing-modality and cross-domain scenarios. Furthermore, the distilled lightweight student model is deployment-friendly for edge and resource-constrained environments, preserving key accuracy metrics while exhibiting strong robustness against modality absence and domain shift. Comprehensive analyses using t-SNE/CKA and ECE/Brier metrics, together with KPI simulations (MTTD/MTTR), further verify the efficiency and practicality of the proposed method in real-world applications.