The problem of viewpoint variations is a challenging issue in vision-based human action recognition. With the richer information provided by three-dimensional (3-D) point clouds thanks to the advent of 3-D depth cameras, we can effectively analyze spatial variations in human actions. In this paper, we propose a volumetric spatial feature representation (VSFR) that measures the density of 3-D point clouds for view-invariant human action recognition from depth sequence images. Using VSFR, we construct a self-similarity matrix (SSM) that can graphically represent temporal variations in the depth sequence. To obtain an SSM, we compute the squared Euclidean distance of VSFRs between a pair of frames in a video sequence. In this manner, an SSM represents the dissimilarity between a pair of frames in terms of spatial information in a video sequence captured at an arbitrary viewpoint. Furthermore, due to the use of a bag-of-features method for feature representations, the proposed method efficiently handles the variations of action speed or length. Hence, our method is robust to both variations in viewpoints and lengths of action sequences. We evaluated the proposed method by comparing with state-of-the-art methods in the literature on three public datasets of ACT42, MSRAction3D, and MSRDailyActivity3D, validating the superiority of our method by achieving the highest accuracies.