|
1.INTRODUCTIONImage semantic segmentation evolved from traditional image segmentation methods. Prior to the implementation of deep learning, research had already explored traditional unsupervised segmentation methods such as Ostu, FCM and N-Cut segmentation [1]. However, these methods only extract low-level features of the image, resulting in segmentation without semantic information. With the increasing application of deep learning in computer vision, image segmentation has shifted its focus towards semantic information resulting in the rise of image semantic segmentation. It is a crucial task in the field of computer vision and the key to image understanding, which allows machines to gain a deeper understanding of the environment in which an image is presented through image semantic segmentation. Image semantic segmentation is a dense prediction task that requires pixel-by-pixel classification, with different categories distinguished by different colors. Semantic segmentation has extensive applications in fields like geographic information systems [2], unmanned vehicle driving [3], and medical image analysis [4]. Nevertheless, the field of semantic segmentation encounters several challenges currently, including inadequate segmentation accuracy, loss of small-scale targets, segmentation discontinuity, and other issues. Deep learning based image semantic segmentation models are usually performed using En-decoder structures combined with atrous convolutions and pyramid modules, such as FCN [5], SegNet [6], UNet [7] and DeepLab families [8-10]. However, the Convolutional Neural Network(CNN) can only extract local features, and the sensory field is limited to the range of the convolutional kernel size, and this property affects the accuracy of semantic segmentation of the image. To solve this problem, this paper proposes to incorporate the CBAM [11] attention module into the skip connection of the En-decoder model. This approach strengthens the extraction of local features by the CNN and combines it with global information. The recalibration of the features through the CBAM attention module effectively enhances the flow of information in the network. The mechanism of attention originates from the study of human vision, in which the human retina tends to select a specific part of the visual area and focus on it in order to rationalize the use of visual information processing resources. The primary objective of applying attentional mechanisms to different visual tasks is to selectively focus on an object or its part. This mechanism consists of two main stages. The first stage involves determining the crucial part of the input, while the second stage involves allocating limited visual resources to that part. In computer vision, attention mechanisms can serve as independent modules for neural networks, allowing for the selection or weight assignment of input components. While traditional image extraction techniques, such as feature extraction, saliency detection, and sliding window methods, resemble attention mechanisms, their localized focus renders them inadequate for image semantic segmentation tasks. Some representative works apply spatial, channel attention or a combination of the two to CNNs to enhance the feature representation of the network without significantly increasing the parameters and computation. For instance, the SEBlock [12] is a plug-and-play substructure that constructs a channel attention module using Squeeze and Excitation in two steps. While it enhances the classification effect, it unavoidably increases the parameter and computation amount. It is not a complete network structure but can be incorporated into classification and detection models. In contrast, the CBAM module utilized in this chapter incorporates a sleek pick-and-pass technique that aims to improve the representation of the network’s features by modeling the connections amongst individual channels and spatial locations. 2.RELATED WORK2.1En-decoder Network ModelIn 2015, Ronneberger O [7] proposed the U-shaped En-decoder structure UNet, which is able to fuse the high-level semantic features with the corresponding low-level detail features of the encoder, effectively preserving the spatial information of the image. The UNet used in this paper is different from the original UNet, which is manifested in the fact that the original UNet needs to be cropped, while the UNet used in this paper does not need to be cropped, which reduces the loss of features; in addition, the encoder of this UNet adopts the pre-trained network VGG16, and the setup of the neural network module of the En-decoder is different from that of the original UNet, which makes it a novel and suitable UNet network for the task of this paper. As illustrated in figure 1, the network architecture employed in this section is based on the design principles of UNet, incorporating a contracting pathway to gather contextual information and a symmetric expansion pathway to accurately localize features. The decoder comprises four blocks, each comprised of an Upsampling Layer, Fusion Layer, Convolutional Layer, and BN Layer. The Upsampling Layer is utilized to restore the resolution of the feature map. The fusion layer concatenates the up-sampled feature maps with the maximally pooled feature maps for each block in the encoder. A total of four sequential operations are executed until matching the resolution of the feature map to the original map, succeeded by numerous convolutional and BN layers, an activation layer, a Reshape layer, and ultimately, feeding into the softmax layer to attain the segmentation map. We call it Enhanced UNet(EUNet). 2.2Attention MechanismsAttention mechanisms were originally introduced in the field of computer vision. In 2014, V Mnih et al. [13] proposed a distinctive recurrent neural network model. This model extracts features from images or videos by selectively choosing a succession of regions or locations and subsequently processing those selected regions at a high level of detail. In 2015, Bahdanau et al. [14] introduced the attention mechanism in the field of Natural Language Processing(NLP), utilizing a sequence-to-sequence combined with the Attention model for machine translation, leading to significant improvement in its effectiveness. In June 2017, the team at Google Machine Translation [15] suggested a solution for the sequence issue by using solely the attention mechanism. They presented two attention mechanisms, disregarding the En-decoder model combined with the attention mechanism’s inherent pattern, and produced positive results. Subsequently the attention mechanism was also applied to the field of image semantic segmentation. Hu X et al. [16] proposed ACNet to selectively collect features from RGB and depth branches. The main contributions are the Attention Complementary Module(ACM) and the architecture with three parallel branches, which makes it possible to utilize more high-quality features from different channels. Attentional mechanisms have been effective in various tasks, such as sequence learning [17], image localization and comprehension [18], image captioning [19], and lip-reading [20]. In these applications, it can be integrated into one or more layers as a separate module, providing a high-level abstraction between modalities. In 2018, O Oktay et al. [21] proposed Attention UNet, which integrates the model based on UNet and uses attention gates at the end of each skip connection to add an attention mechanism to the features extracted by the neural network to realign the output features of the encoder. Soham Chattopadhyay et al. [22] proposed MsAUNet(Multi-scale Attention UNet) for scene segmentation, which uses attention gates to take the features from the encoder and the output from the pyramid pooling layer as inputs, and connects the resulting outputs with the up-sampled output from the previous pyramid pooling layer, further connecting and performing inputs to the next layer. The network can extract local features and their global context with higher precision. Edwin Thomas et al. [23] proposed Multi-Res-Attention UNet, which improves UNet by embedding Attention Gating Block in the skip connections of UNet to achieve effective segmentation of MRI images, where the attention module Attention Gating Block consists of an Attention Gating Signal(GS), which enables UNet to segment targets of different shapes and sizes by suppressing features in irrelevant regions of the input image while highlighting salient features that are useful for a particular task to prevent the loss of fine features between the intermediate and final layers. 3.CBAMUNET MODELInaccurate semantic information exists in the shallow features, and the direct superposition of deep and shallow features generates a large amount of noise, leading to a decrease in the segmentation accuracy of the model. This paper implements an attention mechanism to assign pixel-level weights to shallow features, reducing the output of inaccurate semantic information, and focusing on vital features while suppressing unnecessary ones [24]. Additionally, integrating the channel attention mechanism(CAM) and the spatial attention mechanism(SAM) into the UNet skip connection enhances the feature expressive power. Since convolutional operations combine cross-channel and spatial information to extract features, we utilize the CBAM module to accentuate important features in both spatial and channel dimensions. 3.1CBAMUNet Overall ArchitectureThe CBAMUNet network’s model structure is illustrated in figure 2(left). It is derived from the enhanced UNet network with the inclusion of CBAM modules in the skip connection. The network is composed of four coding blocks, four decoding blocks, and four CBAM modules. On one hand, the output of coding block i serves as the input of coding block i +1. On the other hand, it is concatenated with the output of decoder i –1 as the input of decoder i. In the encoder-decoder skip connection, the CBAM module is incorporated, and the channel and spatial attention modules are applied in sequence. This enables each branch of the skip connection to promote the effective flow of information within the network by emphasizing or suppressing “content” and “location” information on the channel and spatial axes, thereby improving the feature representation of the network. From a spatial perspective, channel attention focuses on global information while spatial attention focuses on local information. Therefore the two attention modules can be applied serially to construct a three-dimensional(3D) attention map. This model can solve the problem of the same target being segmented into different parts to some extent, improve the segmentation effect and enhance the mIoU of segmentation. 3.2CBAM Attention MechanismConvolutional Attention Module(CBAM), a simple but powerful attention module for use with feed-forward CNNs. Given an intermediate feature map, our module sequentially infers the attention map along two separate dimensions(channel and spatial) and then adaptively adjusts the features by multiplying the attention map with the input feature map. CBAM is a lightweight, general-purpose module that can be embedded into any CNN architecture with little additional overhead, and can be trained end-to-end with the underlying CNN. As presented in figure 2(left), the input to CBAM is an intermediate feature map F ∈ RC×H×W, on which one-dimensional channel attentional feature descriptors Mc ∈ RC×1×1 and two-dimensional spatial attentional feature descriptors Ms ∈ RC×H×W are then sequentially derived, and the overall attentional mechanism process is summarized as follows: where ⊗ represents element-wise multiplication, attention values for the channel are broadcast along the channel dimension, while attention values for the space are broadcast along the spatial dimension. 3.3Channel Attention Module and Spatial Attention ModuleAs shown in figure 2 (upper right), the Channel Attention Module(CAM) utilizes the relationship between channels of features to produce a channel attention map. Each channel of the feature map acts as a distinct feature detector, and by introducing an attention mechanism to the channels, the CNN’s capacity to represent the features is enhanced, emphasizing the more valuable ones. Channel attention is calculated as Mc(F) : As illustrated in figure 2(lower right), the Spatial Attention Module(SAM) enhances the Channel Attention Module. To compute the spatial attention, we initially employ average and maximum pooling operations along the channel axes to produce reliable feature descriptions. Subsequently, after concatenating the feature descriptors, we apply a convolutional layer to produce a spatial attention feature map. The map encodes the emphasized or suppressed positions and discerns the most significant portion of the feature map. The feature map for spatial attention is calculated as Ms(F) : Max pooling and average pooling operations are present in both attention models, where max pooling encodes salient information and average pooling encodes global statistical information, and serial connected channels and spatial attention maps work better than parallel connected channels and spatial attention maps. 4.EXPERIMENTAL PROCEDURE AND RESULTS ANALYSIS4.1Experimental ProcedureDatasets. We use the processed CamVid [29] dataset, divided into 367 images as the training sets, 100 images as the validation set, and 233 images as the testing sets. The image resolution is 360×480, and the original 32 semantic categories are merged into 12 semantic categories including background, which are tree, sky, building, car, light pole, road, sidewalk, pedestrian, fence, traffic light, cyclist, and background. In the experiments of this paper, we crop the image to 320×320 size and other settings of the dataset are maintained. The form of the image semantic segmentation dataset can be seen in figure 3. The figure displays four columns including the original image, label map, visual label map, and visual prediction map. For image semantic segmentation, 0 to the category number is assigned for labeling each pixel of the image. The labeled map with this type appears in black and grey color, and figure 3 shows the label map that offers no target information. For improved visual clarity, each category in the experiment is represented by a specific color. As shown in the visualization label map in figure 3, light poles are depicted in dark blue, trees in light blue, pedestrians and so on. The network is trained using numerous labeled training images, enabling pixel-by-pixel prediction of pixel category through a trained model. The visual prediction map in figure 3 presents segmented result maps conveniently using diverse color representations. The subsequent sections present experimental result maps using the same presentation format. Figure 3.The first column is the original map, the second column is the labeled map, the third column is the visualized labeled map, and the fourth column is the predicted map. ![]() Loss Functions. We have performed an experiment that shows our customized composite loss function (custom_loss) is more efficient at segmenting smaller targets compared to the cross-entropy loss function. The custom_loss is used in semantic segmentation networks and can be defined as follows: where categorical _crossentropy and generalized_dice_loss are defined as: where categorical _crossentropy is cross-entropy loss function previously mentioned, N is the number of pixels in the image, M is the number of categories for semantic segmentation, yij. is the true value, and Evaluation Function. Two evaluation metrics are used in the experiments of this paper: mean Intersection-over- Union(mIoU) and Pixel Accuracy (PA)[28].
Data Augmentation. Deep learning algorithms are all based on massive data, but in this paper, our training sets has only 367 pieces of data, data augmentation is an effective way to expand the data volume and enrich the sample information, so that the data can meet the needs of the model. The most common data augmentation method is to transform the original data in the dataset to expand the training sets by using manual rule methods. The experiments in this paper use Python Augmentor library to realize data augmentation and expand the dataset to 2202 images, mainly using seven kinds of expansion rules: image flip, image rotate, image translation, brightness adjustment, chroma adjustment, contrast adjustment, sharpness adjustment, as shown in figure 4. 4.2Experimental Results and AnalysisBefore Data Augmentation. In this paper, we validate the effectiveness of the CBAM module using the CamVid dataset and perform ablation experiments on the CBAMUNet network and the improved UNet network, respectively, where the improved UNet model is referred to as EUNet model. From the overall performance metric mIoU in table 1, it can be seen that the mIoU of CBAMUNet(UNet model with CBAM embedded) improves by 2.1% on the testing sets and 0.9% on the training sets compared to the EUNet model. Thus, it is possible to verify the effectiveness of the CBAM module for the UNet model. Table 1.Quantitative Evaluation Results of CBAMUNet and EUNet Models on Testing and Training Sets.
Table 2 shows the segmentation results of EUNet, CBAMUNet and other models for each category on CamVid testing sets, as can be seen from the table, EUNet’s mIoU is better than that of the previous models for sky, building, car, road, sidewalk and traffic light, but lower for tree, pedestrian, and fence. Table 2.Segmentation results of EUNet, CBAMUNet model and other models for each category on testing sets.
According to figure 5(left), as observed from the comparison between the last two columns, the CBAMUNet segmentation does not divide the traffic light poles into separate parts in the close-up view. Also, the connection between the street light and the pole is more clearly segmented in the distant view, and the number of segmented pedestrians under the poles is higher compared to the third column of the EUNet segmentation. Finally, the outlines of the characters are closer to the labeled image. In conclusion, the segmentation results of the CBAMUNet model (fourth column) are superior to those of the EUNet model(third column), in both level of detail and segmentation accuracy. This provides conclusive evidence that the CBAM module enhances the stability and completeness of the segmentation of the En-decoder model. The incorporation of the attention mechanism in the UNet network, CBAMUNet, results in a significant improvement in the segmentation quality of the En-decoder model UNet. The CBAM attention module enhances a CNN’s feature extraction capability. Figure 5.The results of semantic segmentation of CBAMUNet and EUNet models in testing sets(left) and training sets(right) are plotted from left to right as original, labeled, EUNet and CBAMUNet segmentation, respectively. ![]() The CBAMUNet model has excellent segmentation performance on most of the categories, and the mIoU of trees, cars and light poles are improved by 34.7%, 7.3% and 8.9% respectively compared to the EUNet model, from which it can be concluded that the CBAM module improves the performance of the UNet network in the task of semantic segmentation of images. Among them, the CBAMUNet model achieves optimal results for the segmentation of trees, cars, sidewalks and light poles relative to other models, thus proving the effectiveness of the CBAMUNet model. Unfortunately, both models need to be improved for segmentation of pedestrians and fences. As shown in figure 5(right), the segmentation results on the training sets yield roughly similar results to the testing sets. However, table 3 shows that the CBAMUNet model fits the CamVid dataset very well, and the average mIoU on the training sets is higher than the mIoU on the testing sets, and we speculate that due to the small amount of data, the model may suffer from some degree of overfitting, leading to poor generalization performance of the model. Next, this problem is improved using data augmentation, which in turn improves the robustness of the CBAMUNet model. Table 3.Segmentation results for each category on testing sets for the EUNet, CBAMUNet and AugCBAMUNet.
After Data Augmentation. As shown in table 3, we labeled the CBAMUNet model after data augmentation as AugCBAMUNet. Compared to the CBAMUNet model, the mIoU of this model on the segmented target fence increased by 16.9%. We speculated that after data augmentation, the number of training sets for the segmented target fence increased significantly. The mIoU of the trees decreased significantly, which can be attributed to the deformation of the target trees in the process of data augmentation. Overall, the mIoU of the model after data augmentation was significantly improved compared to that before augmentation. As shown in table 4(first 3 columns), the PA of AugCBAMUNet model on testing sets and training sets differed by 2.3%, while the accuracy(PA) of CBAMUNet model on testing sets and training sets differed by 8.7%. From the results of this experiment, it can be seen that the difference in accuracy between the strengthened model on testing sets and training sets is reduced, and the generalization ability is improved. Table 4.The PA of the model before and after augmentation on the training sets and the testing sets, as well as the classification of segmentation accuracy improved significantly.
As shown in table 4(last 5 columns), the segmentation of the five targets, fences, buildings, roads, cars, and pedestrians, is significantly improved after increasing the dataset, especially the mIoU of pedestrians is significantly improved. As shown in table 5, the segmentation results on the training sets, both before and after augmentation, the mIoU of each target is very high, and the mIoU of the augmented model is lower than that of the pre-augmented model. Our goal is to reduce the mIoU on the training sets, and this result is able to validate our hypothesis. Table 5.Segmentation results of CBAMUNet model and AugCBAMUNet model for each category on training sets.
It can be seen from figure 6(left) that the segmentation effect of the after-augmented model is significantly better than that of the pre-augmented model. The first is the original image, the second is the label map, the third is the enhanced model AugCBAMUNet, and the fourth is the pre-augmented model CBAMUNet. In the first row, the degree of segmentation of light poles, walls, and pedestrians in the third black box in the third column is obviously better than that in the fourth column. Similarly, in the second row, the three pedestrians in the lower left black box of the third column in the third row are segmented, while the pedestrians in the fourth black box show the phenomenon of identification error and identification loss. The segmentation effect of light poles, buildings, and trees is shown in the upper right black box. The enhanced model is obviously better than the pre-enhanced model, and the fourth column in the right black box in the fourth row is more chaotic than the third column. It can be seen that the data augmentation significantly improves the segmentation effect of the model. Figure 6.The segmentation results of CBAMUNet model on the testing sets(left) and training sets(right) before and after augmentation. ![]() As can be seen in figure 6(right), the segmentation results on the training sets are excellent both before and after dataaugmentation, indicating that our model fits very well, and the generalization ability of the model improves somewhat after data augmentation, but still needs to be improved. 5.CONCLUSIONTo enhance the semantic segmentation effect of the UNet network, this paper proposes CBAMUNet by incorporating the CBAM attention module into the En-decoding model’s skip connection of UNet. The aim is to improve the ability of the CNN to express features. The CBAM module adaptively recalibrates the encoder output feature map by explicitly modeling channel and spatial dependencies. It enhances the ability of CNN to extract features by extracting image features across channels and spaces. Second, the model uses a new loss function, custom_loss, which facilitates the prediction of small target problems while retaining the classic and commonly used cross-entropy loss function. Comparison experiments with EUNet show that the CBAM module improves the level of detail and accuracy of segmentation of the UNet model. The CBAMUNet model is able to significantly improve the segmentation of the image and increase the accuracy of the semantic segmentation of the image, and the mIoU is increased by 2.1% compared to the model before embedding CBAM. On this basis, the generalization capability of the CBAMUNet model can be improved by data augmentation. In the future, we will apply better En-decoder network and attention mechanism modules that are more suitable for our task to improve the segmentation accuracy of CamVid datasets. ACKNOWLEDGMENTSThis work was supported in part by University-level research projects of Sanya University. The research project is titled: Research on multimodal customer portrait in automobile precision sales (Project Number: USYJSPY2242); REFERENCESHuang, P., Zheng, Q., Liang, C.,
“Overview of image segmentation methods,”
Journal of Wuhan University (Science Edition),
(2020). Google Scholar
Qiu, C. P., Mou, L. C., Schmitt, M., et al.,
“Local climate zone-based urban land cover classification from multiseasonal sentinel-2 images with a recurrent residual network,”
ISPRS Journal of Photogrammetry and Remote Sensing, 154 151
–162
(2019). https://doi.org/10.1016/j.isprsjprs.2019.05.004 Google Scholar
Sun, L., Yang, K. L., Hu, X. X., et al.,
“Real-time fusion network for RGB-D semantic segmentation incorporating unexpected obstacle detection for road-driving images,”
IEEE Robotics and Automation Letters, 5
(4), 5558
–5565
(2020). https://doi.org/10.1109/LSP.2016. Google Scholar
JIN, Q. G., MENG, Z. P., PHAM, T. D., et al.,
“DUNet: A deformable network for retinal vessel segmentation,”
Knowledge Based Systems, 178 149
–162
(2019). https://doi.org/10.1016/j.knosys.2019.04.025 Google Scholar
Long, J., Shelhamer, E., Darrell, T.,
“Fully convolutional networks for semantic segmentation,”
in In Proceedings of the IEEE conference on computer vision and pattern recognition,
3431
–3440
(2015). Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.,
“Segnet: A deep convolutional encoder-decoder architecture for image segmentation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 39
(12), 2481
–2495
(2017). https://doi.org/10.1109/TPAMI.34 Google Scholar
Ronneberger, O., Fischer, P., Brox, T.,
“U-net: Convolutional networks for biomedical image segmentation,”
in In International Conference on Medical Image Computing and Computer-Assisted Intervention,
234
–241
(2015). Google Scholar
Chen, L. C., Papandreou, G., Kokkinos, I., et al.,
“Semantic image segmentation with deep convolutional nets and fully connected CRFs,”
arXiv:1412.7062,
(2014). Google Scholar
Chen, L. C., Papandreou, G., Kokkinos, I., et al.,
“Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 40
(4), 834
–848
(2017). https://doi.org/10.1109/TPAMI.2017.2699184 Google Scholar
Chen, L. C., Papandreou, G., Schroff, F., et al.,
“Rethinking atrous convolution for semantic image segmentation,”
arXiv:1706.05587,
(2017). Google Scholar
Woo, S., Park, J., Lee, J. Y., et al.,
“Cbam: Convolutional block attention module,”
in Proceedings of the European Conference on Computer Vision (ECCV),
3
–19
(2018). Google Scholar
Hu, J., Shen, L., Sun, G.,
“Squeeze-and-excitation networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
7132
–7141
(2018). Google Scholar
Mnih, V., Heess, N., Graves, A., et al.,
“Recurrent models of visual attention,”
Advances in Neural Information Processing Systems, 3
(2014). Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.,
“Neural machine translation by jointly learning to align and translate,”
arXiv:1409.0473v5,
(2014). Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.,
“Attention is all you need,”
,” In Advances in Neural Information Processing Systems, 5998
–6008
(2017). Google Scholar
Hu, X., Yang, K., Fei, L., et al.,
“ACNet: Attention based network to exploit complementary features for rgbd semantic segmentation,”
in 2019 IEEE International Conference on Image Processing (ICIP),
(2019). https://doi.org/10.1109/ICIP40777.2019 Google Scholar
Miech, A., Laptev, I., Sivic, J.,
“Learnable pooling with context gating for video classification,”
arXiv preprint arXiv:1706.06905,
(2017). Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.,
“Spatial transformer networks,”
arXiv preprint arXiv:1506.02025,
(2015). Google Scholar
Xu, K., Ba, J., Kiros, R., et al.,
“Show, attend and tell: Neural image caption generation with visual attention,”
in In International Conference on Machine Learning,
2048
–2057
(2015). Google Scholar
Chung, J. S., Senior, A., Vinyals, O., et al.,
“Lip reading sentences in the wild,”
in In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
3444
–3453
(2017). Google Scholar
Oktay, O., Schlemper, J., Folgoc, L. L., et al.,
“Attention U-Net: Learning where to look for the pancreas,”
CoRR, abs/1804.03999,
(2018). Google Scholar
Chattopadhyay, S., Basak, H.,
“Multi-scale attention U-Net (MsAUNet): A modified U-Net architecture for scene segmentation,”
arXiv: 2009.06911,
(2020). Google Scholar
Thomas, E., Pawan, S. J., Kumar, S., et al.,
“Multi-Res-Attention UNet : A CNN model for the segmentation of focal cortical dysplasia lesions from magnetic resonance images,”
IEEE Journal of Biomedical and Health Informatics, 99
(2020). Google Scholar
Gao, X., Li, C. G.,
“Real-time semantic segmentation of images based on attention and multi-label classification,”
Journal of Computer Aided Design and Graphics, 33
(01), 59
–67
(2021). https://doi.org/10.3724/SP.J.1089.2021.18233 Google Scholar
Noh, H., Hong, S., Han, B.,
“Learning deconvolution network for semantic segmentation,”
in Proceedings of the IEEE International Conference on Computer Vision,
1520
–1528
(2015). Google Scholar
Visin, F., Ciccone, M., Romero, A., et al.,
“RESEG: A recurrent neural network-based model for semantic segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
41
–48
(2016). Google Scholar
Yu, F., Koltun, V.,
“Multi-scale context aggregation by dilated convolutions,”
arXiv preprint arXiv:1511.07122,
(2015). Google Scholar
He, K., Zhang, X., Ren, S., et al.,
“Spatial pyramid pooling in deep convolutional networks for visual recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 37
(9), 1904
–1916
(2015). https://doi.org/10.1109/TPAMI.2015.2389824 Google Scholar
Cordts, M., Omran, M., Ramos, S., et al.,
“The cityscapes dataset for semantic urban scene understanding,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
3213
–3223
(2016). Google Scholar
|