Nowadays, due to various challenges such as large-scale variation of population, mutual occlusion, perspective distortion and so on, crowd counting has gradually become a hot issue in computer vision. To address the large- scale variation exists in the images, in this paper, we propose a novel multi-scale network called MSNet which aims to maintain continuous variations and count the number of pedestrians accurately. While most state-of-the- arts multi-scale and multi-column networks aim to integrate the scale information of heads with different size, lots of researches still need to do to achieve continuous variations. In MSNet, specifically, the first ten layers of the visual geometry group network(VGG) are used as the backbone to extract the rough features of images and a multi-scale block is employed to maintain the scale information which contains several receptive kernels to obtain a better performance towards the difficulty of scale-variation. Inspired by the knowledge that using multiple small receptive field kernels to replace a single large receptive field will get a better performance, we utilize two dilated convolutions with the receptive field of 5 to replace the large kernel. Our MSNet has moderate increase in computation, and we evaluate our method on three benchmark datasets including ShanghaiTech (Part A: MAE=59.6, RMSE=96.1; Part B: MAE=7.5, RMSE=12.1), UCF-CC-50(MAE=207.9, RMSE=273.8) and UCF-QNRF(MAE=93, RMSE=158) to show the outperformance of our method.
Pneumonia, an infectious disease that can influence the lungs, is a severe medical field topic. Therefore, how to correctly classify images of pneumonia is very important. The limitations of traditional machine learning algorithms and the significant improvement of computing performance make deep learning widely used. At present, using a convolutional neural network to classify pneumonia is still the mainstream method. This paper provides a modified capsule network to detect and classify pneumonia by using X-ray pictures. The model consists of two parts: encoder and decoder. Encoder contains convolutional layer, primary capsule layer, and digital capsule layer. The primary capsule layer and digital capsule layer convert a scalar into a vector and then try to cluster vectors of the same category by dynamic routing. The decoder contains a deconvolutional layer. The image is reconstructed by up-sampling the vector generated by the encoder, and the reconstructed image is compared with the original image to make the features extracted by the encoder more representative. The training and testing process takes place on the dataset "Labeled Optical Coherence Tomography (OCT) and Chest XRay Images for Classification." This dataset contains a total of 5856 pictures. We divide the images into the training set and testing set at a ratio of 8:2. The accuracy rate on this dataset is 98.6%. This model has a more straightforward structure and fewer parameters than other popular models, which means that it can be more easily deployed in various conditions in practical applications.
This paper aims at comparing 4 top models for crowd counting and evaluating their highlights based on their performance. In DSNet, the distended convolution block network was proposed, where the distended layers are densely connected to each other in order to preserve information from continuously varied scales. Three blocks are cascaded and linked to dense residual connections to widen the range of levels covered by network and also a novel loss of consistency at multi-scale density level was introduced to improve performance. In SFANet, two foremost elements with VGG backbone CNN and two-way path multi-scale fusion networks were suggested for the front end feature extractor and back end to make density map in which one path highlights crowded regions present in images. The other direction is responsible for the fusion of multi-scale features and for the generation of the final high-quality high-density maps. In MANet (Multi-scale Attention Network), a new mechanism of soft attention was presented, which learns a series of masks and a level-conscious loss feature was introduced to regularize and direct the learning of different branches to specialize on a specific scale. In Bayesian Loss, a novel loss function was used to generate a density contribution model from the point annotations. We also analyzed the results of the 4 convolutional neural networks, extracted the pattern of convolutional neural network structure and found promising pathways for researchers in this fast-growing area.
Crowd counting is an important part of crowd analysis, which is of great significance to crowd control and management. The convolutional neural network (CNN) based crowd counting method is widely used to solve the problem of insufficient counting accuracy due to heavy occlusion, background clutters, head scale and perspective changes in crowd scenes. The multi-column convolutional neural network (MCNN) is a CNN-based method for crowd counting, which adapts to head scale variation of crowd scenes by constructing multi-column convolutional neural network composing of three single-column networks corresponding to the convolution kernel with different sizes (large, medium and small). However, as the MCNN network is relatively shallow, its receptive field is also limited, which affects the adaptability to large scale variations. In addition, due to insufficient training data, it is necessary to carry out a pre-training strategies which pre-trains the single-column convolutional neural network individually and combines the cumbersome. In this paper, a crowd counting method based on multi-column dilated convolutional neural network was proposed. Dilated convolution was used to enhance the receptive field of the network, so as to be better adaptive to the head scale variations. The image patches were obtained by randomly clipping from the original training data set images in the process of each iterative training to further expand the training data, while the training could be achieved without tedious pre-training. The experimental results on ShanghaiTech public dataset showed that the accuracy of crowd counting proposed in this paper was better than that of MCNN, which proved that this method is more robust to head scale variations in crowd scenes.
Since accurate early detection of malignant lung nodule can greatly enhance the survival of the patient, detection of early stage lung cancer with chest computed tomography (CT) scans is a major problem from the last couple of decades. Therefore, automated lung cancer detection techniques is important. However, it is a significant challenge to accurately detect lung cancer at the early stage due to substantial similarities in the structure of the benign and the malignant lung nodules. The major task is to reduce the false positive and false negative results in lung cancer detection. Recent advancements in convolutional neural network (CNN) models have improved image detection and classification for many tasks. In this study, we presented a deep learning-based framework for automated lung cancer detection. The proposed framework works in multiple stages on 3D lung CT scan images to detect and determine the malignancy of the nodules. Considering 3D nature of lung CT data and the compactness of mixed link network (MixNet), two deep 3D faster R-CNN and U-Net encoder-decoder with MixNet were designed to detect and learn the features of the lung nodule, respectively. For the classification of the nodules, the gradient boosting machine (GBM) with 3D MixNet was proposed. The proposed system was tested with manually draw radiologist contours on 1200 images obtained from LIDC-IDRI including 3250 nodules by using statistical measures. LIDC-IDRI comprises of equal number of benign and malignant lung nodules. The proposed system was evaluated on this data set in the form of sensitivity (94%), specificity (90%), area under the receiver operating curve (0.99) and obtained better results compared to the existing methods.
Speech recognition has always been one of the research focuses in the field of human-computer communication and interaction. The main purpose of automatic speech recognition (ASR) is to convert speech waveform signals into text. Acoustic model is the main component of ASR, which is used to connect the observation features of speech signals with the speech modeling units. In recent years, deep learning has become the mainstream technology in the field of speech recognition. In this paper, a convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model. Traditional acoustic model training is based on frame-level labels with cross-entropy criterion, which requires a tedious label alignment procedure. The CTC loss was adopted to automatically learn the alignments between speech frames and label sequences, such that the training process is end-to-end. The architecture can exploit temporal and spectral structures of speech signals simultaneously. Batch normalization (BN) technique was used for normalizing each layers input to reduce internal covariance shift. To prevent overfitting, dropout technique was used during training to improve network generalization ability. The speech signal was transformed into a spectral image through a series of processing to be the input of the neural network. The input feature is 200 dimensions, and output labels of acoustic mode is 415 Chinese pronunciation without pitch. The experimental results demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.
An image encryption method combing chaotic map and Arnold transform in the gyrator transform domains was
proposed. Firstly, the original secret image is XOR-ed with a random binary sequence generated by a logistic map. Then,
the gyrator transform is performed. Finally, the amplitude and phase of the gyrator transform are permutated by Arnold
transform. The decryption procedure is the inverse operation of encryption. The secret keys used in the proposed method
include the control parameter and the initial value of the logistic map, the rotation angle of the gyrator transform, and the
transform number of the Arnold transform. Therefore, the key space is large, while the key data volume is small. The
numerical simulation was conducted to demonstrate the effectiveness of the proposed method and the security analysis
was performed in terms of the histogram of the encrypted image, the sensitiveness to the secret keys, decryption upon
ciphertext loss, and resistance to the chosen-plaintext attack.
For the gyrator transform-based image encryption, besides the random operations, the rotation angles used in the gyrator transforms are also taken as the secret keys, which makes such cryptosystems to be more secure. To analyze the security of such cryptosystems, one may start from analyzing the security of a single gyrator transform. In this paper, the security of the gyrator transform-based image encryption by chosen-plaintext attack was discussed in theory. By using the impulse functions as the chosen-plaintext, it was concluded that: (1) For a single gyrator transform, by choosing a plaintext, the rotation angle can be obtained very easily and efficiently; (2) For image encryption with a single random phase encoding and a single gyrator transform, it is hard to find the rotation angle directly with a chosen-plaintext attack. However, assuming the value of one of the elements in the random phase mask is known, the rotation angle can be obtained very easily with a chosen-plaintext attack, and the random phase mask can also be recovered. Furthermore, by exhaustively searching the value of one of the elements in the random phase mask, the rotation angle as well as the random phase mask may be recovered. By obtaining the relationship between the rotation angle and the random phase mask for image encryption with a single random phase encoding and a single gyrator transform, it may be useful for further study on the security of the iterative random operations in the gyrator transform domains.
An image hiding method based on cascaded iterative Fourier transform and public-key encryption
algorithm was proposed. Firstly, the original secret image was encrypted into two phase-only masks
M1 and M2 via cascaded iterative Fourier transform (CIFT) algorithm. Then, the public-key
encryption algorithm RSA was adopted to encrypt M2 into M2' . Finally, a host image was
enlarged by extending one pixel into 2×2 pixels and each element in M1 and M2' was
multiplied with a superimposition coefficient and added to or subtracted from two different elements in
the 2×2 pixels of the enlarged host image. To recover the secret image from the stego-image, the
two masks were extracted from the stego-image without the original host image. By applying
public-key encryption algorithm, the key distribution was facilitated, and also compared with the image
hiding method based on optical interference, the proposed method may reach higher robustness by
employing the characteristics of the CIFT algorithm. Computer simulations show that this method has
good robustness against image processing.
In this paper, double-random phase-encoding based image hiding method was employed to encrypt and hide text. The
ASCII codes of the secret text information was denoted as binary, and then transformed to a 2-dimensional array in the
form of an image. Each element in the transformed array has a value between 0 and 255, where the highest 2 bits or the
highest 4 bits were stored with the binary bits of the text information, while the lower bits were filled with binary bits.
Then, the double-random phase-encoding method was used to encode the transformed array, and the encoded array was
hidden into an expanded cover image to achieve text information hiding. Experimental results show that the secret text
can be recovered accurately with the ratio of 100% and 99.89% by storing the binary bits of the text information to the
highest 2 bits and the highest 4 bits of the transformed array, respectively. By employing the optical information
processing method, the proposed method can improve the security of text information transmission, while keeping high
The binary phase-only filter (BPOF) based watermarking for image authentication was proposed earlier  and it shows good performance. In this paper, three image self-authentication algorithms based on BPOF with different watermark embedding methods from Reference 5 are proposed. The BPOF of an image is used as the watermark to embed into its Fourier spectrum by adding to the magnitude or quantizing the magnitude of the Fourier spectrum. For image authentication, either the correlation between the BPOF of the test image with its Fourier magnitude spectrum is computed, or the correlation between the phase information of the test image with the candidate watermark extracted from it is calculated, according to the different embedding methods used in the embedding stage. By using these embedding methods, we expand the potential watermark embedding strength, which is very important for security and robustness of digital watermarking. The performance of these three algorithms is evaluated via computer simulation.