Abstract:To address the challenges of significant feature ambiguity, high multi-scale missed detection rates, and insufficient environmental interference robustness in negative obstacle detection under complex environments, an improved multimodal collaborative sensing model named YOLOv10-MCS is proposed based on the YOLOv10 framework. In the backbone network, the Receptive-Filed Attention Convolution(RFAConv) module replaces conventional convolution operations,leverages dynamic multi-branch receptive fields and spatial attention mechanisms to enhance low-contrast edge feature extraction. The Context Guided Block(CGB) enables adaptive fusion of global semantics and local details, effectively resolving boundary ambiguity-induced missed detections. The Cross-Scale Feature Fusion Module(CCFM) reconstructs the neck network using channel normalization and cross-layer concatenation, optimizing multi-scale feature consistency while achieving lightweight design. Integrated channel-spatial recalibration via Global Attention Mechanism(GAM) significantly suppresses background interference. Experimental results show that the YOLOv10-MCS model achieves 88.13% precision, 85.80% mean Average Precision(mAP), and 5.7 GFLOPs computational cost. Compared to the baseline model, these metrics represent a 5.96% precision improvement, 3.3% mAP gain, and 32.1% computation reduction. YOLOv10-MCS establishes a new technical framework for complex scene object detection through cross-modal feature interaction. The proposed framework with high-precision lightweight architecture demonstrates deployment potential in autonomous driving perception systems and robotic systems for dynamic obstacle avoidance.