Human action recognition is a hot topic of research, due to its wide ranging application in automatic video analysis, video retrieval and more. However, the existing action recognition methods focus on non-static parts of the video, while the static parts are largely discarded. This is affecting the accuracy of action recognition and location. In this paper, a new hierarchical space-time segments representation designed for both action recognition and localization that incorporates multi-grained representation of the parts and the whole body in a hierarchical way. The proposed algorithm comprises three major steps. We first apply hierarchical segmentation on each video frame to get a set of segment trees, each of which is considered as a candidate segment tree of the human body. In the second step, we prune the candidates by exploring several cues such as shape, articulated objects’ structure and global foreground color. Finally, we track each segment of the remaining segment trees in time both forward and backward. The experimental results show that, the performance of our method is better than the state-of-art action recognition methods on two challenging benchmark datasets UCF-Sports and HighFive, and at the same time produce good action localization results.