High-resolution remote sensing images (HRRSIs) contain rich local spatial information and long-distance location dependence, which play an important role in semantic segmentation tasks and have received more and more research attention. However, HRRSIs often exhibit large intraclass variance and small interclass variance due to the diversity and complexity of ground objects, thereby bringing great challenges to a semantic segmentation task. In most networks, there are numerous small-scale object omissions and large-scale object fragmentations in the segmentation results because of insufficient local feature extraction and low global information utilization. A network cascaded by convolution neural network and global–local attention transformer is proposed called CNN-transformer cascade network. First, convolution blocks and global–local attention transformer blocks are used to extract multiscale local features and long-range location information, respectively. Then a multilevel channel attention integration block is designed to fuse geometric features and semantic features of different depths and revise the channel weights through the channel attention module to resist the interference of redundant information. Finally, the smoothness of the segmentation is improved through the implementation of upsampling using a deconvolution operation. We compare our method with several state-of-the-art methods on the ISPRS Vaihingen and Potsdam datasets. Experimental results show that our method can improve the integrity and independence of multiscale objects segmentation results. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Transformers
Image segmentation
Feature extraction
Semantics
Convolution
Remote sensing
Windows