Real time object detection and classification is essential for outdoor surveillance. Current state of the art real time object detection CNNs are trained on natural image datasets. However, outdoor surveillance images have very different characteristics: objects tend to be small and difficult to distinguish (averaging only 3% of image size). In addition, images come in different modalities, for example, nighttime surveillance images are grayscale thermal images representing heat emission not light reflection. Our dataset of images acquired from surveillance videos is comprised of ˜ 640 Daytime (DAY) color images and ˜ 360 nighttime grayscale THERMAL images. The dataset included three object categories: animals, people and vehicles. Because of the lack of large datasets for these scenarios, we evaluated using the much larger VOC dataset to augment our datasets. We conducted a study to determine the best combination of images to include in the training dataset, and how different types of images (i.e. DAY, THERMAL and VOC) affect each-others performance. We examined state of the art object detection and classification CNN architectures, focusing on accuracy and real time performance. By combining different images types THERMAL, DAY and 1200 VOC images in one dataset, the best results were obtained using transfer learning on YOLO-V3 with SPP, achieving 89.5 mAP for DAY images, and 79.53 mAP for THERMAL images, running at 35 fps. This setup provides a robust solution for many surveillance scenarios: night and daytime; far, small objects, as well as zoomed-in, large objects.