Abstract:Aiming at the shortcomings of existing methods in scene text detection, a scene text detection method based on pixel allocation is proposed, and a cross-attention module and a multi-scale feature adaptive module are used to optimize feature extraction in space and channel respectively. In order to enrich the feature representations of different scales, a multi-scale feature adaptive module is used to automatically assign the weights of features of different scales. In order to effectively obtain contextual information, the features extracted by the feature network are fed into the cross-attention module. For each pixel, contextual information is collected on its horizontal path and vertical path. Then through the loop operation, each pixel can obtain context information in the whole image. Through the fully convolutional network method, the multi-task learning framework is used to learn the geometric features of the text instance, and the results of the multi-task learning are combined to complete the allocation of pixels to the text box, and the polygonal bounding box of the text instance is reconstructed after simple processing. Tested on the public dataset Total-text with any shape, the recall rate, precision rate, and F value of the method in this paper are 75.71%, 89.15%, and 81.89%, respectively, and it also performs well on the multi-directional public dataset ICDAR2015. The recall rate, precision rate, and F value are 79.06%, 89.24%, and 83.84%, respectively, which proves the effectiveness of the method in this paper.