COVID-19 detection and heatmap generation in chest x-ray images

Abstract. Purpose: The outbreak of COVID-19 or coronavirus was first reported in 2019. It has widely and rapidly spread around the world. The detection of COVID-19 cases is one of the important factors to stop the epidemic, because the infected individuals must be quarantined. One reliable way to detect COVID-19 cases is using chest x-ray images, where signals of the infection are located in lung areas. We propose a solution to automatically classify COVID-19 cases in chest x-ray images. Approach: The ResNet-101 architecture is adopted as the main network with more than 44 millions parameters. The whole net is trained using the large size of 1500×1500 x-ray images. The heatmap under the region of interest of segmented lung is constructed to visualize and emphasize signals of COVID-19 in each input x-ray image. Lungs are segmented using the pretrained U-Net. The confidence score of being COVID-19 is also calculated for each classification result. Results: The proposed solution is evaluated based on COVID-19 and normal cases. It is also tested on unseen classes to validate a regularization of the constructed model. They include other normal cases where chest x-ray images are normal without any disease but with some small remarks, and other abnormal cases where chest x-ray images are abnormal with some other diseases containing remarks similar to COVID-19. The proposed method can achieve the sensitivity, specificity, and accuracy of 97%, 98%, and 98%, respectively. Conclusions: It can be concluded that the proposed solution can detect COVID-19 in a chest x-ray image. The heatmap and confidence score of the detection are also demonstrated, such that users or human experts can use them for a final diagnosis in practical usages.

The rest of this paper is organized as follows. Section 2 explains the proposed method of classifying COVID-19 in chest x-ray images. Section 3 illustrates the experimental results in various scenarios. Then, results are discussed in Sec. 4, and conclusions are summarized in Sec. 5. Figure 1 shows an overview framework of the proposed solution. The training, validating, and testing chest x-ray images are resized into 1500 × 1500 pixels. 20 Then, the real-time data augmentation is applied on the original training images. Both original and augmented training images are fed to train the classification model based on the backbone architecture as explained below. The trained model is then validated with the original validating images. If the validating result is converged or the maximum number of epochs is reached, then the training and validating processes are stopped and the final model is concluded. Otherwise, it goes back to the data augmentation process and repeats to the next epoch.

Materials and Methods
In the testing phase, the trained model is applied on each resized chest x-ray image, to compute predicted class, its confidence score, and heatmap. The details are explained in the following sections.

Backbone Architecture
The ResNet-101 21,22 is adopted in our proposed method as the backbone architecture of the COVID-19 classification model. It is a very deep network containing deep layers as shown in Table 1, 23 with more than 44 millions parameters.
As shown in Table 1, each square bracket represents each building block of convolutional layers which are parametrized by kernel size and filter. For example, the kernel size of 7 × 7 means the height and width of the two-dimensional convolution window are both 7. While, the filter refers to the dimensionality of the output space or the number of output filters in each convolutional layer. 24 The third column represents a number of defined building blocks used in each convolutional layer. For example, in the layer named "conv5," each building block contains three convolutional layers connected in sequence as shown in Fig. 2.
The input to the network is a chest x-ray image, as shown in Fig. 1. The ResNet-101 has a version with pretrained weights using ImageNet dataset which is a large-scale classification dataset containing 1.2 million training images from 1000 classes of objects. 22 However, the input image's size must be limited to the pretrained requirement of 224 × 224 pixels. This may not cope well in the case of differentiating normal class from other-normal classes.
In this paper, the ResNet-101 is trained from scratch using the input images of the large size 1500 × 1500 pixels. The top part (i.e., classification part) of ResNet-101 is replaced with the global average pooling, softmax, and output layers. Five types of data augmentations are added on the training dataset, including zoom, rotate, shear, flip, and shift. 25 The proposed solution develops two types of models which are different in the output layer. The first model is developed to classify COVID-19 class from any non-COVID-19 class, having two nodes in the output layer. While, the second model is developed to classify a chest x-ray  image into three classes having three nodes in the output layer of COVID-19, normal without any diseases, and norther normal with some other diseases or remarks.

Lung Segmentation
In this paper, the lung segmentation is required in the step of heatmap visualization, where the color maps are shown in the area of segmented lungs only. The pretrained U-Net-based model 26 is adopted in the proposed solution of lung segmentation, since it has been successfully used for the medical image segmentation. The U-Net contains two main activities of convolution and transposed convolution. The transposed convolution is a process to increase the spatial resolution of the input by upsampling the kernel. It is called U-Net because its architecture looks like a U shape, where a front side of the Ushape contains convolution layers for downsampling and a back side of the U-shape contains transposed convolution layers for upsampling. The convolution and transposed convolution layers of the U-shape are summarized in Table 2.
The input layer is connected to the first building block of the front side of U-shape. While, the output layer of two nodes (i.e., lung and non-lung nodes) is connected to the last building block of the back side of U-shape. The pretrained U-Net is adopted for the lung segmentation. 27 It reported the Dice similarity coefficients of 0.985 and 0.972 on the datasets of Montgomery and JSRT, 28 respectively. In addition, the average size of segmented lungs is about 29.8% of the original size of input images.

Heatmap Generation
As shown in Fig. 1, the heatmap is generated for each test x-ray image. Since there are many layers and a large number of filters, the average of the filters' weights of the last convolutional layer is calculated and visualized. This is because they could represent the feature maps directly.
The key steps are listed below.
• A test chest x-ray image is fed into the trained ResNet-101 model. The predicted filters' weights are also computed at this stage.
• All the filters' weights in the last convolutional layer are extracted.
• The average weight from all filters' weights is calculated.
• The average weight is used as a mask on the test chest x-ray image to generate the heatmap.
• The heatmap is visualized only on the lung areas segmented by the pretrained U-Net.

Results
This section explains and discusses our experimental results on different scenarios. Both our own dataset and published dataset 4 are used in the experiments, as shown in Table 3. For the published dataset, only chest x-ray images with COVID-19 are used in our experiments, because they are used to validate the cross-datasets scenario of COVID-19 detection.
In addition, in our D1 dataset, the 142 images of COVID-19 cases were obtained from three levels of the severity as: (1) 22 images of the severe level, (2) 13 images of the moderate level, and (3) 107 images of the mild level. Each patient case has only one image taken in each instance. In the training and testing processes, each individual image is fed as the input into the CNN-based model at a time.
Five datasets, as listed in Table 3, are used in our four scenarios of experiments as below. The results are reported in terms of confusion matrix, accuracy, sensitivity, and specificity.
• Scenario 1. Two classes prediction: COVID-19 (class 1) and non-COVID-19 (class 2); train and validate: 100 images from D1 (class 1) and 100 images from D2 (class 2); test: 42 images from D1 (class 1), 5118 images from D2 (class 2), 100 images from D3 (class 2), 100 images from D4 (class 2). Images from the five datasets (D1 to D5) are independently split into two subsets of (1) training and validating set and (2) testing set, as mentioned in each scenario. Later, the training and validating set is further randomly split into the training set and validating set with proportions of 90% and 10%, respectively, in each epoch of the CNN training phase. So, in all cases, images in training, validating, and testing sets are independent and nonoverlapped. The numbers of independent images in training, validation, and testing arrangements of each scenario are summarized in Table 4.
In this paper, the positive class is drawn when the confidence value predicted by the trained CNN-based model is higher than the cut-off score. The descriptive statistical analysis is used to determine the results in terms of sensitively, specificity, and accuracy. These performances are also compared with other existing methods in the literature.

Scenario 1
This scenario is designed to validate the constructed model on two classes of COVID-19 and non-COVID-19. It is trained, validated, and tested on chest x-ray images of COVID-19 cases and normal cases without any diseases or remarks. Also, it is tested on unseen/untrained datasets of D3 and D4, which have other diseases or remarks similar to COVID-19. The confusion matrix is shown in Table 5. As shown in Table 5, considering only trained datasets (i.e., D1 and D2), the sensitivity and specificity are 97% and 98%, respectively. However, if taking both seen and unseen datasets into consideration (i.e., D1, D2, D3, and D4), the specificity is dropped, especially the unseen datasets of D3 and D4. Rather than using the predictions directly from the output layer (as shown in Table 5), the predictions of COVID-19 are calculated using the cut-off of 90% confidence scores. Its confusion matrix is shown in Table 6.    Since the cut-off score of COVID-19 is increased, the specificity is also increased on both seen and unseen datasets. However, the specificity on unseen datasets is still not promising. Another side-effect is that the sensitivity is getting lower. It is not sensible to lower the sensitivity for the medical diagnosis. Even the unseen datasets contain non-COVID-19 images, but they could be confused with COVID-19 class because they contain diseases or remarks on lungs similar to COVID-19 cases. Therefore, in the scenario 2, the datasets D3 and D4 will be included in the non-COVID-19 class for training.

Scenario 2
The scenario 2 is designed to extend the scenario 1 by adding the datasets D3 and D4 into the training process. So, the model could also learn the non-COVID-19 cases that have remarks on lungs of other diseases. The confusion matrix is shown in Table 7. The specificity on the datasets D3 and D4 is now significantly higher, when compared with the result shown in the scenario 1. This is because they are now also used in the learning process. However, the sensitivity is lower, when compared with the result in the scenario 1. It could be because the COVID-19 images are confused with the images of other diseases. This can be solved by splitting the problem into three classes instead of two classes, which will be discussed in the scenario 4.
The additional experiments are conducted in this scenario, to see the tradeoff between the accuracy and the training time when increasing the size of input images. The results are reported as: (1) using 1500 × 1500 pixels, the accuracy of predicting COVID-19 class is 73%, the accuracy of predicting non-COVID-19 class is 93%, and the training time is 2 h and 27 min; (2) using 1000 × 1000 pixels, the accuracy of predicting COVID-19 class is 40%, the accuracy of predicting non-COVID-19 class is 100%, and the training time is 1 h and 13 min; (3) using 500 × 500 pixels, the accuracy of predicting COVID-19 class is 0%, the accuracy of predicting non-COVID-19 class is 100%, and the training time is 34 min. The models are trained using NVIDIA-V100 Tensor Core. However, the testing time is very fast and not significantly different among these three different sizes of input images. Therefore, in this paper, the size of 1500 × 1500 pixels is used as it is the maximum size in which our machine's memory can handle in the training process.

Scenario 3
The scenario 3 is designed to test the trained model from the scenario 1, with the COVID chest x-ray images of unseen dataset (i.e., D5). The classification results are calculated based on two different cut-off values of 50% and 90% on the confidence scores, as shown in Table 8.  The proposed solution could achieve the high sensitivity score of 93% on the cross-dataset scenarios, where D1 and D2 are used for training and validating, but unseen D5 is used for testing. This shows the regularization of the constructed model of COVID-19 classification.

Scenario 4
The scenario 4 is designed for the experiment of classifying chest x-ray images into three classes including COVID-19 (class 1), normal (class 2), and other normal (class 3). The confusion matrix is shown in Table 9.
As shown in Table 9, the class 1 of COVID-19 and the class 3 of other normal are confused with each other in some extent. This is because the class 3 contains chest x-ray images having remarks similar to COVID-19. They were recorded from elderly patients with minimal fibrosis and spondylosis of spine, and patients with other diseases including tuberculosis, pneumonia, and pulmonary edema. In addition, the class 2 is confused with the class 3 because they share the common features of non-COVID-19. Table 10 shows the experimental results of the proposed method and other existing methods in the literature. This is considered to be the indirect comparison since they are tested on different datasets. The performances of all methods are comparable. However, the proposed method achieves the best average score (97.7%) of three values of sensitivity, specificity, and accuracy.

Comparisons
Using a large size of input images in this paper, our proposed method could achieve a better performance when compared with using a smaller size of input images. This is because of two main reasons. First, signals of COVID-19 in each image contain a larger number of pixels. This is useful in the training process especially when the proportion of COVID-19's signals is small. Second, the distortion from reducing size of the original image appears to be less because the reduction ratio is smaller.

Heatmaps and Confidence Scores
As shown in Fig. 1, the heatmap is computed for each test chest x-ray image to emphasize highweight signals of COVID-19. The filters' weights on the final convolutional layer are extracted to compute the final heatmap. Sample filters' weights of one test chest x-ray image are shown in Fig. 3. In this example, the high-weights (i.e., yellow color) are located around lungs' regions, because COVID-19 could damage lungs.
Then, the final heatmap is generated by averaging these filters' weights. It is computed for individual test chest x-ray image. Sample final heatmaps are shown in Fig. 4. The first three heatmaps are computed from chest x-ray images with COVID-19. It can be seen that the high-weights of yellow patches are regions detected by the trained model to be signals of COVID-19. The medical experts can concentrate on these regions to final check the disease,    Table 3  97  98  98 images with non-COVID-19 are shown in Fig. 5. It is clearly seen that the test images with COVID-19 could be classified correctly to be COVID-19 with very high confidence scores of above 90%. Also, the test images with non-COVID-19 could be classified correctly with very low confidence scores to be COVID-19 of below 30% or, in other words, with very high confidence scores to be non-COVID-19 of above 70%.

Discussion
The original trained model using the proposed method can classify a chest x-ray image into two classes of COVID-19 and non-COVID-19. The training and validating samples of the non-COVID-19 class are normal chest x-ray images without any remarks or diseases. This model is shown to achieve high performance on testing COVID-19 and normal chest x-ray images. This is mainly because the patterns of COVID-19 and normal cases are seen in the training process. However, its performance is significantly dropped when it is tested with chest x-ray images with other remarks or diseases such as fibrosis, spondylosis of spine, tuberculosis, pneumonia, and pulmonary edema. This is as expected because patterns of other remarks or diseases are not seen and learned in the training process. Also, they occur in lungs' regions as similar to COVID-19. So, they could be easily confused with COVID-19, using this developed model. It can result in many false detection/positive cases, which could not be acceptable for practical usages.  Therefore, the model is further improved by adding sample chest x-ray images containing other remarks and diseases in the training and validating processes. This makes the model to learn differences between patterns of COVID-19 and patterns of other diseases. It results in increasing the specificity score and reducing the false detection of COVID-19. However, the sensitivity score is also lower, when compared with the original trained model.
Then, the next version of the developed model is trained to classify a chest x-ray image into three classes of COVID-19, normal, and other diseases. This could maintain the sensitivity score and increase the specificity score. This is because separating the other diseases class from the non-COVID-19 class can reduce the confusion between COVID-19 and other diseases, and the confusion between normal and other diseases. As shown in Fig. 5, the COVID-19 cases are clearly separated from the non-COVID-19 cases (i.e., normal and other diseases), with very high confidence scores.
As additionally reviewed by the expert, the generated heatmaps of the COVID-19 cases could identify areas of COVID-19 correctly. However, some heatmaps of the false-positive cases are reported incorrectly as shown in Fig. 6. The first two images are normal cases and the last image contains another abnormality. The confidence scores of being COVID-19 of the three images are all higher than 50%. However, to be classified as COVID-19 with the cut-off of 90%, only the second image is wrongly classified. In addition for these three cases, the heatmaps incorrectly highlight the non-COVID-19 cases as COVID-19 (i.e., yellow areas).
However, none of the developed models can achieve a perfect performance of 100% accuracy. Thus, they should be adopted for the prefiltering of normal cases, by cutting off chest x-ray images that are classified to be COVID-19 with very low scores-that is, they have high confidence to be non-COVID-19. In this way, it can be used to reduce a number of chest x-ray images that must be manually diagnosed by human experts.
In addition, the heatmap is generated to emphasize possible areas of being COVID-19 in each chest x-ray image. This can be an assistive tool for human experts to be used together with the computed confidence score, to conclude the final diagnosis.

Conclusions
This paper presents a solution for COVID-19 classification in chest x-ray images. Its backbone CNN architecture is developed using ResNet-101. The model is trained from scratch with a large size of the network's input of 1500 × 1500 pixels. Data augmentation is also applied on the original training images to enhance the regularization of the model. It is developed in two versions of classification: two-classes-based and three-classes-based. The two-classes-based version is used to classify chest x-ray images into COVID-19 and non-COVID-19. The threeclasses-based version is used to classify chest x-ray images into COVID-19, normal, and other abnormal. The proposed solution achieves very promising sensitivity, specificity, and accuracy of 97%, 98%, and 98%, respectively. The developed solution can also generate the heatmap with a confidence score of being COVID-19, to emphasize the result on each test image. The heatmap is visualized on only lung regions segmented using U-Net.