Human action recognition in indoor environment can prove to be very crucial in avoiding serious accidents and (or) damage. Application domain spans from monitoring the actions of solitary elders or persons with disabilities to monitoring persons working alone in a chamber or in isolated industry environment. These scenarios demand an automatic near real-time activity recognition and alert to save life and assets. In this work, considering the fact that the sensing modality should be capable of working round the clock in a non-intrusive manner, we have opted for thermal infrared camera, which captures the heat emitted by objects in the scene and generates an image. Motivated by the recent success of convolutional neural networks (CNN) for human action recognition in IR images, we extend this work by incorporating one additional dimension i.e. the temporal information. In this work, we have designed and implemented a 3D-CNN for learning the spatial as well as the sequential features in the thermal IR videos. In this work, eight action classes are considered - Walking, Standing, Falling, Lying, Sitting, Falling from chair, Sitting up (recovering from fall from sitting posture), Getting up (recovering from fall from lying posture). To evaluate the proposed framework, infrared (IR) videos of different actions were generated in three diverse environments of home – inside study room, inside a bedroom and in the garden. The dataset comprised of 2641 and 894 IR videos for training and testing respectively, each of half a second duration performed by more than 50 volunteers. We have designed and implemented 3D-CNN, comprising of two blocks, each of two convolution and one max pool layer, which automatically constructs features from raw data incorporating both spatial and temporal information to learn actions. Network parameters are learned using back-propagation algorithm and the learning is supervised. Experimental results indicate 85% classification accuracy on 894 complex test videos of the proposed Spatio-Temporal Deep Learning architecture on the IR action dataset.