Terror attacks are often targeted towards the civilians gathered in one location (e.g., Boston Marathon bombing). Distinguishing such ’malicious’ scenes from the ’normal’ ones, which are semantically different, is a difficult task as both scenes contain large groups of people with high visual similarity. To overcome the difficulty, previous methods exploited various contextual information, such as language-driven keywords or relevant objects. Although useful, they require additional human effort or dataset. In this paper, we show that using more sophisticated and deeper Convolutional Neural Networks (CNNs) can achieve better classification accuracy even without using any additional information outside the image domain. We have conducted a comparative study where we train and compare seven different CNN architectures (AlexNet, VGG-M, VGG16, GoogLeNet, ResNet- 50, ResNet-101, and ResNet-152). Based on the experimental analyses, we found out that deeper networks typically show better accuracy, and that GoogLeNet is the most favorable among the seven architectures for the task of malicious event classification.