|
1.INTRODUCTIONEsophageal cancer is the sixth leading cause of cancer-related mortality and the eighth most common cancer worldwide[1]. It affects more than 500 000 people worldwide and the incidence is rapidly increasing [2]. To date, barium esophagram is a global test for patients with dysphagia because it can simultaneously detect morphologic and functional abnormalities in the pharynx and esophagus [3]. In addition, due to its inexpensive and noninvasive feature as well as widely available imaging procedures, barium esophagram has a priority than other technologies such as endoscopy for clinical diagnostic selection [4]. Although esophagoscopy is the gold standard for esophageal cancer diagnosis, it is invasive and expensive [5]. Therefore, barium esophagram would be a cost-effective approach that can serve as an important initial diagnostic test for patients with dysphagia [6,7]. Previous study suggested that barium esophagram is a sensitive modality to diagnose esophageal cancer [8]. However, inter-observer inconsistent interpretations are inevitable because of manual examination through radiologists and diagnosis by the conventional visual assessment. Furthermore, multi-directional barium esophagrams often consume a lot of time for interpretation. Van et al presented a novel algorithm for automatic detection of early cancerous tissue in High Definition (HD) endoscopy[9]. The algorithm achieves a result with recall of 0.95 and precision of 0.75 in 38 lesions by calculating local color and texture features based on the original and Gabor-filtered image. Van et al presented an algorithm which employed specific texture, color filters, and machine learning to detect early neoplastic lesions in Barrett ’ s esophagus[10]. Based on 100 images from 44 patients with Barrett’s esophagus, the system achieved a sensitivity and specificity of both 0.83 at image-level and a sensitivity and specificity of 0.86 and 0.87 at patient-level. In the above literature, the esophageal cancer CAD system is mainly studied for esophagoscopy. Zhang et al. designed and developed an automatic deep learning system to detect esophageal cancer [5]. This study marked the first use of deep learning methods to detect esophageal cancer on barium esophagography. While the system showed remarkable results in terms of recall rate, it faced challenges in processing barium meal esophagography with irregular, complex morphological characteristics, particularly in distinguishing between the structure of a narrow benign esophagus and malignant esophageal carcinoma. Nowadays, Deep learning algorithms, in particular Convolutional Neural Networks, have rapidly become a methodology of choice for analyzing medical images. Girshick R proposed an object detection method R-CNN based on the combination of RoI (Region of Interest) and CNN (Convolutional Neural Network)[11]. This algorithm is the first masterpiece of an object detection algorithm based on deep learning. R-CNN first generates 2000 category-independent RoIs containing suspicious objects from the input image and then inputs these 2000 RoIs into the CNN model to extract fixed-length feature vectors from each RoI. These feature vectors are used as the input of the SVM and are got by SVM the probability that the RoI belongs to a certain category of object. Although R-CNN implements an object detection algorithm based on deep learning, the parameter of convolution layers cannot be shared during the feature vector extraction of RoIs. Girshick R was inspired by SPP-NET[12] and proposed the Fast Region-based Convolutional Neural Network Fast R-CNN[13]. This algorithm first proposed the RoIPooling layer which makes the parameters of convolution layers used for RoI feature vector extraction share. However, Selective Search algorithm which runs on the CPU is still used in the extraction of RoIs and it leads to the slow speed of the algorithm. Ren Shaoqing proposed the object detection network Faster R-CNN[14] which used Region Proposal Network (RPN) to detect region proposals. The main steps of object detection include extracting Region proposals from images, extracting features from Region proposals, and classification and regression of the bounding box, all of which are integrated into the network. Therefore, once Faster R-CNN is proposed, it shows obvious advantages in medical image detection. Figure 1 shows the main challenges of detecting esophageal cancer in images of the barium esophagram. Given this and inspired by other researcher’s work above, we propose a deep learning system to identify positive patient cases and detect esophageal cancer. The system mainly consists of two networks, Faster R-CNN detection network with Deformable FPN (FRDF) and images of the barium esophagram classification network with attention module (BEAM). The layout of this paper is organized as follows: the dataset and proposed research methodology are given in Section 2. The experimental settings, parameter optimization, performance metrics, results analysis, and ablation study are shown in Section 3. A brief discussion is provided in Section 4, and the conclusion and future work are presented in Section 5. 2.METHODS2.1DatasetIn this research, the dataset was collected from patients who underwent barium esophagography in Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology from January 2017 to June 2019. Positive and negative groups were set based on the clinical, radiographic, endoscopic, and surgical findings. Before the experiment commenced, cases with unmatched results between the radiographic report and esophagoscopy or pathology, respectively, and poor image quality were excluded. Multiple positions X-ray images were taken for a comprehensive evaluation of the condition of the esophagus. Images in the vast majority of patients were obtained at the following three positions: (1) anteroposterior view, (2) right anterior oblique view, and (3) left anterior oblique view. All data were stored in Digital Imaging and Communications in Medicine (DICOM) format. In total, 6445 images were obtained from the 200 patients with pathologically confirmed esophageal cancer, and 11352 images were obtained from the 299 patients without esophageal cancer. All images of esophageal cancer lesions were annotated by a single board-certified radiologist. Using LabelImage software, a rectangular bounding box was drawn on the barium esophagram for esophageal cancer detection. During the annotation process, the results of the radiographic report, esophagoscopy, or surgical pathology were referred to side-by-side. The labeled images were then reviewed by another experienced radiologist (with 12 years of experience). 2.2Algorithm flowThe FRDF extracts features by backbone network to locate and classify RoIs. Backbone network is deep and more suitable for locating RoIs rather than classifying RoIs. Therefore, this paper proposes a deep learning system to locate RoIs and classify RoIs rather than unifying these tasks into the FRDF. Firstly, the images of the barium esophagram are input into the detection network to get RoIs and corresponding confidence. Then, RoIs with low confidence are removed and the rest are sent to the classification network to judge whether they contain tumors. The image of the barium esophagram is considered to be negative if detection network does not detect RoIs or the detected RoIs are all negative, otherwise, it is positive. Each case consists of multiple positions, each consisting of multiple barium esophagrams. We propose a majority voting method based on images of the barium esophagram classification result to classify positions. It means that the position is considered positive if 50% or more of the images in this position are classified as positive, and vice versa. Finally, the patient is classified as negative if and only if all positions are negative. Conversely, it is classified as a positive patient. For the positive case, the extracted RoIs are lesions. 2.3Detection networkAs shown in Figure 2, the FRDF (Faster R-CNN detection network with Deformable FPN) is mainly composed of Deformable ResNet50 Backbone Network, Feature Pyramid Network, Region Proposal Network, MultiScale-RoIAlign Layer, and Fast R-CNN Predictor Network. The images of the barium esophagram are input to the Deformable ResNet50 Backbone Network to extract features. Feature Pyramid Network fuses features of different layers. Region Proposal Network generates region proposals by these features. For region proposals of different sizes, the MultiScale-RoIAlign Layer maps their feature vectors to the same size. Finally, the feature vectors of these region proposals are input into the Fast R-CNN Predictor Network to output RoIs and corresponding confidence. Each improved networks are introduced as follows. 2.3.1Deformable ResNet50 Backbone NetworkIn FRDF, the backbone network is proposed to extract features. The feature extraction ability of ResNet50 is improved by a large number of convolution layers. However, ResNet50 is still defective for extracting features of images because convolution layers can’t adapt to esophageal deformation due to its fixed geometric structures. Therefore, a Deformable ResNet50 Backbone Network is proposed as shown in Figure 3a. It replaces the residual bottleneck block in the last convolutional group of ResNet50 with Deformable residual bottleneck block shown in Figure 3b. 3*3 Deformable Convolutional Network in Deformable residual bottleneck block as shown in Figure 3c. 2.3.2Feature Pyramid NetworkFeature Pyramid Network (FPN) as shown in Figure 4. It merges convolution layers to enhance the semantic features. As a result, the features of convolution layers in the last convolutional group (Deformable_conv5.x in Figure 4) are fused to Conv2.x, Conv3.x and Conv4.x convolutional groups. So these convolutional groups also contain features after adaptive sampling with Deformable_conv5.x. Replacing bottleneck residual block with Deformable residual bottleneck block in these groups is redundant. Therefore, this work only replaces the residual bottleneck block with Deformable residual bottleneck block in Conv5.x group. Specifically, the feature maps in the last convolution layer of each convolutional group are changed channels through the 1×1 convolution layer. Then feature map with a lower resolution is upsampled (nearest neighbor upsampling) and merged into a feature map with a higher resolution through element addition. Finally, the 3×3 convolution layer is used to generate the feature prediction layer (P3, P2, P1, R0). P4 is generated through MaxPooling based on P3. The feature prediction layer R0 generates minimal anchors and these anchors will generate minimal region proposals by RPN. However, no minimal lesions exist in the images and these useless region proposals will significantly increase the computation costs. So R0 is only used to complete feature mapping in some region proposals instead of using as a predictive feature layer. 2.3.3MultiScale-RoIAlign LayerThe purpose of the MultiScale-RoIAlign Layer as shown in Figure 5 is to map feature vectors between feature maps and region proposals and then scale these feature vectors to the same size. The mapped region proposal in the feature map is divided into 7*7 bins. In this work, we find that the best performance can be obtained by setting four sampling points. So the bin is divided into 2*2 regions on average and no rounding is performed. Bilinear interpolation is used for calculating the central point value of these regions. Finally, the mean of these four central points is taken as the feature value of this bin. Similar processing is done for other bins and these feature values of bins are concatenated as the fixed feature vector for this region proposal. 2.4Barium esophagram classification network with attention moduleIn the FRDF, feature vectors used for locating RoIs and classifying RoIs are extracted by the Deformable ResNet50 Backbone Network. However, Deformable ResNet50 Backbone Network is deep and suitable for locating RoIs rather than classifying RoIs. Therefore, this paper uses an independent classification network to classify RoIs. The state-of-the-art classification networks (MobileNetV3 [15], EfficentNetV2 [16], etc.) are usually deep, complex, and suitable for classifying natural images. We propose a novel classification network to classify RoIs as shown in Figure 6. The network is mainly composed of three blocks, one attention module, and two linear layers. Block1, block2, and block3 are respectively composed of two 3*3*64, two 3*3*128, and three 3*3*256 convolutional layers. These blocks extract features from RoIs, and then these features are enhanced by the attention module. Finally, these enhanced features are input into the two linear layers and output classification results. The attention module CBAM consists of channel and spatial attention modules. The channel attention module first performs average-pooling and max-pooling operations on feature map F to extract its channel information then generates the corresponding channel attention map through a shared multi-layer perceptron (MLP). Finally, the two channel attention maps are added and the sigmoid activation function is used to generate the channel attention Mc. The spatial attention module firstly carries out average-pooling and max-pooling on the input F along the channel dimension and then uses 7*7 convolutional layers to extract spatial context information. Finally, the sigmoid activation function is used to generate spatial attention Ms. The CBAM combines the channel attention module and the spatial attention module in order. Its height, width, and channels of outputs are the same as the input F. 3.RESULTSFor the detection network, 5279 images with annotated esophageal cancer lesions from 160 patients are retrieved as the training set, and 1166 images with annotated esophageal cancer lesions from 40 patients are retrieved as the testing set. The data details of detection and classification networks are shown in Table 1 and Table 2, respectively. Table 1.Data details of detection network
Table 2.Data details of classification network
This research is implemented with Pytorch and GPU NIVIDA 2080 Ti with 11 GB onboard memory. The proposed classification network first trains 15 epochs on the ImageNet2012 and fine-tunes on the RoIs dataset for 50 epochs. The optimizer is Adam optimizer (learning rate=0.005, betas= (0.9, 0.999), weight decay=0). At the same time, “Step Learning Rate” is used to adjust the learning rate. Considering classification network is small and no large fluctuation in early training, “Warm up” training strategy is not used in the classification network. 3.1Ablation studyWe conduct ablation experiments on each improved modules of FRDF to verify its impact on performance. It can be seen from Table 3 that the comprehensive performance of the detection network has been improved by introducing the ResNet50+FPN module. Because ResNet50 expands receptive field and FPN overcomes the obstacles of detecting small lesions. After using the Deformable ResNet50+FPN module, the comprehensive detection ability of the model is greatly improved. Because the Deformable ResNet50 backbone network samples the irregular esophagus area adaptively. When the IoU threshold is 0.3 and 0.5, the AP of the model increases most obviously, increasing by 5.19 and 6.16, respectively. These results can show that our proposed Deformable ResNet50 Backbone Network improves the performance of the FRDF essentially. After using MultiScale-RoIAlign, the comprehensive capability of the detection model is further improved. Because the feature vectors of region proposals are extracted by the RoIAlign that no rounding operation is involved in the mapping process. Table 3.The ablation experiments of detection network
Ablation experiments on classification network are conducted to explore the impact of attention module on performance and compare with state-of-the-art networks. Four networks are shown in Table 4, MobileNetV3, EfficientNetV2, the classification network without CBAM, and classification network with CBAM. Comparing three and four lines, the parameters of the model after adding CBAM only increased by 0.02M, the AUC increased by 0.8, and all performance metrics improved significantly. At the same time, classification network with attention module is compared with the current state-of-the-art classification network MobileNetV3 and EfficientNetV2. Ours has fewer parameters and many performance metrics are better than these state-of-the-art classification networks. Table 4.The ablation experiments of classification network and comparison of state-of-the-art networks
3.2Patient-level resultsOn patient-level analysis, the sensitivity and specificity of our proposed algorithm on the 40 positive cases and 53 negative cases are 100% (40/40) and 84.91% (45/53) respectively. This indicates that 40 of the 40 esophageal cancer patients and 45 of the 53 patients with non-esophageal cancer can be identified. In 40 patients with esophageal cancer, when IoU threshold is 0.3, the precision, recall, and AP of FRDF are 88.51%, 87.91%, and 79.09 respectively. This indicates that when the detection conditions are loose, 1158 lesions (1025 true positives and 133 false positives) are detected in the 1168 gold standard labeled by the radiologist. 4.DISCUSSIONThe high-resolution images of the barium esophagram need a deep network to extract features to better locate RoIs. However, the low-resolution RoIs need a shallow network to extract features to complete more accurate classification. The depth of the network should be proportional to the size of the dataset [17]. However, the FRDF extracts the features by the same Deformable ResNet50 Backbone Network to locate RoIs and classify RoIs. Therefore, this paper proposes a deep learning system containing a detection network as well as an independent classification network. The regression subnetwork of detection network is used to locate RoIs and classification subnetwork of detection network is used to eliminate RoIs of low confidence. Filtered RoIs are input to an independent classification network to classify RoIs. On this basis, a majority voting method is used to identify positive cases. For detection network, we introduce the ResNet50+FPN module to better predict small lesions. At the same time, Deformable Convolutional Network is introduced into the ResNet50 backbone network to sample the deformed esophagus area adaptively. To avoid the loss of feature vectors of region proposals in the mapping process, the feature vectors of region proposals are extracted by the RoIAlign that no rounding operation is involved in the mapping process. For classification network, we introduce CBAM to enhance the significance of the esophageal region. At the same time, compared with EfficientNetV2 and MobileNetV3, our classification network parameter amount is greatly reduced, and the performance is greatly improved. Finally, our proposed algorithm is evaluated comprehensively at patient-level. All 40 of 40 patients with esophageal cancer and 45 of 53 patients without esophageal cancer are correctly classified. In these 1166 images of 40 patients with esophageal cancer cases, we evaluate the detection performance of the FRDF. Its AP metric is 63.30 which is 6.59 higher than that of Faster R-CNN of 56.71. The experiment shows that our proposed algorithm performs promising results on the barium esophagram dataset. 5.CONCLUSIONThe proposed detection network FRDF and an independent classification network BEAM are more suitable for detecting esophageal cancer in the barium esophagram. FRDF detects RoIs and BEAM classifies these RoIs. The proposed method may also be used for other diseases that occur in the esophagus. Nowadays, Transformer is one of the hot spots of CV research [18-20], and its excellent modeling ability makes it perform well in CV tasks. But Transformer training often requires a lot of data. Therefore, its effectiveness in the application of medical datasets remains to be studied. In the future, we will try to combine Transformer with CNN and propose a method that is more suitable for medical imaging datasets to get better performance. REFERENCESShah, Manish A., et al,
“Improving outcomes in patients with oesophageal cancer,”
Nature Reviews Clinical Oncology, 20
(6), 390
–407
(2023). https://doi.org/10.1038/s41571-023-00757-y Google Scholar
Thrift, Aaron P.,
“Global burden and epidemiology of Barrett oesophagus and oesophageal cancer,”
Nature reviews Gastroenterology & hepatology, 18
(6), 432
–443
(2021). https://doi.org/10.1038/s41575-021-00419-3 Google Scholar
Zambito, Giuseppe, et al,
“Is barium esophagram enough? Comparison of esophageal motility found on barium esophagram to high resolution manometry,”
The American Journal of Surgery, 221
(3), 575
–577
(2021). https://doi.org/10.1016/j.amjsurg.2020.11.028 Google Scholar
Sanaka, Madhusudan R., et al,
“Clinical success and correlation of Eckardt scores with barium esophagram after peroral endoscopic myotomy in achalasia,”
Journal of Gastrointestinal Surgery, 25
(1), 278
–281
(2021). https://doi.org/10.1007/s11605-020-04763-8 Google Scholar
Zhang, Peipei, et al,
“Development of a Deep Learning System to Detect Esophageal Cancer by Barium Esophagram,”
Frontiers in Oncology, 12 766243
(2022). https://doi.org/10.3389/fonc.2022.766243 Google Scholar
Hawkins, Daniel, et al,
“Dysphagia evaluation: the added value of concurrent MBS and esophagram,”
The Laryngoscope, 131
(12), 2666
–2670
(2021). https://doi.org/10.1002/lary.v131.12 Google Scholar
Zambito, Giuseppe, et al,
“Is barium esophagram enough? Comparison of esophageal motility found on barium esophagram to high resolution manometry,”
The American Journal of Surgery, 221
(3), 575
–577
(2021). https://doi.org/10.1016/j.amjsurg.2020.11.028 Google Scholar
DeWitt, John M., et al,
“Evaluation of timed barium esophagram after per-oral endoscopic myotomy to predict clinical response,”
Endoscopy International Open, 9
(11), E1692
–E1701
(2021). https://doi.org/10.1055/a-1546-8415 Google Scholar
Van Der Sommen, Fons, et al,
“Supportive automatic annotation of early esophageal cancer using local gabor and color features,”
Neurocomputing, 144 92
–106
(2014). https://doi.org/10.1016/j.neucom.2014.02.066 Google Scholar
Van der Sommen, Fons, et al,
“Computer-aided detection of early neoplastic lesions in Barrett ’ s esophagus,”
Endoscopy, 48
(07), 617
–624
(2016). https://doi.org/10.1055/s-00000012 Google Scholar
Girshick, Ross, et al,
in Proceedings of the IEEE conference on computer vision and pattern recognition,
(2014). Google Scholar
He, Kaiming, et al,
“Spatial pyramid pooling in deep convolutional networks for visual recognition,”
IEEE transactions on pattern analysis and machine intelligence, 37
(9), 1904
–1916
(2015). https://doi.org/10.1109/TPAMI.2015.2389824 Google Scholar
Girshick, Ross,
in Proceedings of the IEEE international conference on computer vision,
(2015). Google Scholar
Ren, Shaoqing, et al,
“Faster r-cnn: Towards real-time object detection with region proposal networks,”
Advances in neural information processing systems, 28
(2015). Google Scholar
Howard, Andrew, et al,
in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) IEEE,
(2020). Google Scholar
Tan, Mingxing, and Q. V. Le,
in International Conference on Machine Learning,
(2021). Google Scholar
Cheng, S., Shang, G., Zhang, L.,
in Tenth International Conference on Graphics and Image Processing,
(2019). Google Scholar
Dosovitskiy, Alexey, et al,
in International Conference on Learning Representations,
(2021). Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al, Attention is all you need, https://arxiv.org/abs/1706.03762(2017) Google Scholar
Carion, Nicolas, et al,
in European conference on computer vision,
(2020). Google Scholar
|