Abstract:Commonly used sensors for roadside traffic target perception include cameras, millimeter-wave radars, and lidars. Lidar can perceive 3D information, but it is expensive and susceptible to interference in rain, fog, and dusty weather. The price of the camera is low, and the information it perceives is quite rich, but it is seriously disturbed by light and clutter. Millimeter-wave radar has the advantage of working all day and all day, but it is not good at detecting stationary targets. In order to meet the all-day and all-weather efficient and accurate perception requirements of the traffic system, this paper proposes a fusion detection framework RV-YOLOX, which achieves better detection results than single-source sensors by effectively fusing the information of cameras and millimeter-wave radar sensors. The radar spatial attention module designed in RV-YOLOX absorbs the characteristics of cascade fusion and element-by-element addition fusion, which can facilitate the extraction of more effective information flow by transferring the spatial information of radar to visual features. In addition, this paper also lightweights RV-YOLOX by means of structural re-parameterization, which enables it to achieve faster inference speed while maintaining the original accuracy. Finally, the algorithm is trained and tested on the self-made dataset and NuScenes dataset. Compared with the YOLOX algorithm, the ap index of RV-YOLOX can be improved by about 3~4 points, and the lightweight RV-YOLOX can also improve the inference speed at the same time. Obtain detection accuracy comparable to RV-YOLOX.