Abstract:Given the decisive role of stereo volume construction in the completeness and accuracy of models in multi-view 3D reconstruction, as well as the problems of insufficient accuracy in low-texture regions and high redundancy in multi-view fusion of existing plane sweep-based depth estimation methods, research on an improved stereo volume construction method was carried out. An improved "depth-wise plane sweeping" method was adopted. Under the plane sweep depth estimation framework, a series of virtual planes parallel to the front plane of the reference image camera were constructed to aggregate multi-view features, forming a sparse and efficient plane sweep volume. Combined with multi-scale feature extraction and context-aware cost aggregation, high-precision depth maps were obtained. Improved particle swarm optimization-based ICP registration and entropy-driven supervoxel fusion were used to achieve consistent stereo volume reconstruction of multi-view point clouds. Experimental tests on multiple public datasets and real-world scenarios showed that on the ETH3D dataset, the depth estimation RMSE was reduced to 4.21 mm and the accuracy in low-texture regions was increased to 90.2%; on the Scenes11 dataset, the volume redundancy was reduced to 7.8%; and 28 fps real-time depth estimation was achieved in urban building scenarios. Compared with traditional plane sweeping and DPSNet and other methods, it has stronger robustness in processing high-resolution inputs and textureless regions. This method can meet the application requirements of multi-view 3D reconstruction in real-world scenarios such as urban buildings, balancing reconstruction accuracy and real-time performance.