Visual object classification has long been studied in visible spectrum by utilizing conventional cameras. Since the labeled images has recently increased in number, it is possible to train deep Convolutional Neural Networks (CNN) with significant amount of parameters. As the infrared (IR) sensor technology has been improved during the last two decades, labeled images extracted from IR sensors have been started to be used for object detection and recognition tasks. We address the problem of infrared object recognition and detection by exploiting 15K images from the real-field with long-wave and mid-wave IR sensors. For feature learning, a stacked denoising autoencoder is trained in this IR dataset. To recognize the objects, the trained stacked denoising autoencoder is fine-tuned according to the binary classification loss of the target object. Once the training is completed, the test samples are propagated over the network, and the probability of the test sample belonging to a class is computed. Moreover, the trained classifier is utilized in a detect-by-classification method, where the classification is performed in a set of candidate object boxes and the maximum confidence score in a particular location is accepted as the score of the detected object. To decrease the computational complexity, the detection step at every frame is avoided by running an efficient correlation filter based tracker. The detection part is performed when the tracker confidence is below a pre-defined threshold. The experiments conducted on the real field images demonstrate that the proposed detection and tracking framework presents satisfactory results for detecting tanks under cluttered background.