Abstract:Monocular depth estimation, employing a single camera for its simplicity and ease of installation, is widely applied in the fields of robotics and unmanned aerial vehicles. However, the adoption of complex depth neural network structures based on encoder-decoder architectures in monocular depth estimation algorithms results in lower real-time inference efficiency on edge devices. Consequently, a network architecture is proposed to enable real-time depth estimation on edge devices. This architecture features an encoder designed with inverted residual blocks and a decoder redesigned with residual depth-wise separable convolution and nearest-neighbor interpolation. These modifications significantly reduce the model"s parameters and computational load. Moreover, through cross-layer connections, the features from the encoder and decoder networks are fused to enhance the representation of fine-grained edge details in the depth map. Experimental results demonstrate an 82% reduction in model parameters, a 92% reduction in computational load, achieving state-of-the-art performance on the KITTI dataset. Notably, the proposed architecture achieves a real-time inference speed of 50 frames per second on the Jetson TX2 platform.