This paper aims at comparing 4 top models for crowd counting and evaluating their highlights based on their performance. In DSNet, the distended convolution block network was proposed, where the distended layers are densely connected to each other in order to preserve information from continuously varied scales. Three blocks are cascaded and linked to dense residual connections to widen the range of levels covered by network and also a novel loss of consistency at multi-scale density level was introduced to improve performance. In SFANet, two foremost elements with VGG backbone CNN and two-way path multi-scale fusion networks were suggested for the front end feature extractor and back end to make density map in which one path highlights crowded regions present in images. The other direction is responsible for the fusion of multi-scale features and for the generation of the final high-quality high-density maps. In MANet (Multi-scale Attention Network), a new mechanism of soft attention was presented, which learns a series of masks and a level-conscious loss feature was introduced to regularize and direct the learning of different branches to specialize on a specific scale. In Bayesian Loss, a novel loss function was used to generate a density contribution model from the point annotations. We also analyzed the results of the 4 convolutional neural networks, extracted the pattern of convolutional neural network structure and found promising pathways for researchers in this fast-growing area.
Crowd counting is an important part of crowd analysis, which is of great significance to crowd control and management. The convolutional neural network (CNN) based crowd counting method is widely used to solve the problem of insufficient counting accuracy due to heavy occlusion, background clutters, head scale and perspective changes in crowd scenes. The multi-column convolutional neural network (MCNN) is a CNN-based method for crowd counting, which adapts to head scale variation of crowd scenes by constructing multi-column convolutional neural network composing of three single-column networks corresponding to the convolution kernel with different sizes (large, medium and small). However, as the MCNN network is relatively shallow, its receptive field is also limited, which affects the adaptability to large scale variations. In addition, due to insufficient training data, it is necessary to carry out a pre-training strategies which pre-trains the single-column convolutional neural network individually and combines the cumbersome. In this paper, a crowd counting method based on multi-column dilated convolutional neural network was proposed. Dilated convolution was used to enhance the receptive field of the network, so as to be better adaptive to the head scale variations. The image patches were obtained by randomly clipping from the original training data set images in the process of each iterative training to further expand the training data, while the training could be achieved without tedious pre-training. The experimental results on ShanghaiTech public dataset showed that the accuracy of crowd counting proposed in this paper was better than that of MCNN, which proved that this method is more robust to head scale variations in crowd scenes.