Semantic Segmentation Algorithm for High Resolution Remote Sensing Images with Dual Encoder

doi:10.3778/j.issn.1673-9418.2311071

Abstract

Abstract: Remote sensing images have the characteristics of multi-scale objects, complex backgrounds, and imbalanced categories. The semantic segmentation algorithm based on convolutional neural network (CNN) is difficult to capture the global characteristics of the image, resulting in poor segmentation results. In response to the above problems, using the global feature extraction capability of Swin Transformer, a dual encoder high-resolution remote sensing image semantic segmentation algorithm DEGFNet (dual encoders and global local transformer feature refinement network) is proposed. Firstly, the feature fusion block (FFB) is designed to introduce the global features captured by Swin Transformer into the encoder to address the challenges of multi-scale objects. At the same time, the spatial interaction block (SIB) is designed in Swin Transformer to reduce the negative impact of complex background samples. Secondly, the global local transformer block (GLTB) and the feature refinement block (FRB) are introduced on the decoder to better utilize the information extracted from the encoder and improve the accuracy of semantic segmentation. Finally, a hybrid loss function composed of cross entropy loss and Dice Loss is used to train the model to reduce the negative impact caused by the imbalance of sample categories. On the Vaihingen dataset, the macro-F1 (mF1), mean intersection over union (mIoU) and overall accuracy (OA) metrics reach 91.9%, 84.8% and 92.4%, respectively, and on the LoveDA dataset, the mIoU metric reaches 55.0%, both showing better semantic segmentation effects and good generalization.

Key words: high-resolution remote sensing images, semantic segmentation, convolutional neural network, Swin Transformer

摘要： 遥感影像具有多尺度对象、复杂背景、不均衡类别等特点。基于卷积神经网络（CNN）的语义分割算法难以捕捉到影像的全局特征，导致分割效果不佳。针对以上问题，利用Swin Transformer的全局特征提取能力，提出了双编码端高分辨率遥感影像语义分割算法DEGFNet。设计特征融合模块（FFB）将Swin Transformer捕获的全局特征引入编码端，应对多尺度对象带来的挑战。在Swin Transformer中设计空间交互模块（SIB），降低复杂背景样本带来的负面影响；在解码端引入全局-局部注意力模块（GLTB）和特征细化模块（FRB），来更好地利用编码端提取的信息，提高语义分割的精确性；采用交叉熵损失和Dice Loss组成的混合损失函数训练模型，减轻样本类别不均衡带来的消极影响。在Vaihingen数据集上，宏观平均F1值（mF1）、平均交并比（mIoU）和整体准确率（OA）指标分别达到91.9%、84.8%和92.4%；在LoveDA数据集上，mIoU指标达到55.0%，均展现出了更好的语义分割效果和良好的泛化性。

关键词: 高分辨率遥感影像, 语义分割, 卷积神经网络, Swin Transformer

WU Mengke, GAO Xindan. Semantic Segmentation Algorithm for High Resolution Remote Sensing Images with Dual Encoder[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 187-195.

吴梦可, 高心丹. 双编码端高分辨率遥感影像语义分割算法[J]. 计算机科学与探索, 2025, 19(1): 187-195.

References

[1] 闵锋, 况永刚, 郝琳琳, 等. 多分支特征映射的遥感图像目标检测算法[J]. 计算机科学与探索, 2024, 18(6): 1543-1555.
MIN F, KUANG Y G, HAO L L, et al. Remote sensing image object detection algorithm based on multi-branch feature mapping[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1543-1555.
[2] ZHANG C, ATKINSON P M, GEORGE C, et al. Identifying and mapping individual plants in a highly diverse high-elevation ecosystem using UAV imagery and deep learning[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 169: 280-291.
[3] 张桃红, 郭徐徐, 张颖. LRSAR-Net语义分割模型用于新冠肺炎CT图片辅助诊断[J]. 电子与信息学报, 2022, 44(1): 48-58.
ZHANG T H, GUO X X, ZHANG Y. LRSAR-Net semantic segmentation model for computer aided diagnosis for Covid-19 CT image[J]. Journal of Electronics & Information Technology, 2022, 44(1): 48-58.
[4] DONG R, PAN X, LI F. DenseU-Net-based semantic segmentation of small objects in urban remote sensing images[J]. IEEE Access, 2019, 7: 65347-65356.
[5] FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3146-3154.
[6] LI R, DUAN C, ZHENG S. MACU-Net semantic segmentation from high-resolution remote sensing images[EB/OL]. [2023-09-13]. https://arxiv.org/abs/2007.13083.
[7] ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 2881-2890.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[9] DOSOVITSKIY A, BEYER L, KOLESNIKOV A,et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-09-13]. https://arxiv.org/abs/2010.11929.
[10] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF international Conference on Computer Vision. Piscataway: IEEE, 2021: 10012-10022.
[11] D􀆳ASCOLI S, TOUVRON H, LEAVITT M L, et al. Convit: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 2286-2296.
[12] GRAHAM B, EL-NOUBY A, TOUVRON H, et al. Levit: a vision transformer in convnet􀆳s clothing for faster inference[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 12259-12269.
[13] CHEN J, LU Y, YU Q, et al. Transunet: transformers make strong encoders for medical image segmentation[EB/OL]. [2023-09-13]. https://arxiv.org/abs/2102.04306.
[14] RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Oct 5-9, 2015. Cham: Springer, 2015: 234-241.
[15] GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented transformer for speech recognition[EB/OL]. [2023-09-13]. https://arxiv.org/abs/2005.08100.
[16] 方红, 李德生, 蒋广杰. 高效跨域的Transformer小样本语义分割网络[J]. 计算机工程与应用, 2024, 60(4): 142-152.
FANG H, LI D S, JIANG G J. Efficient cross-domain transformer few-shot semantic segmentation network[J]. Computer Engineering and Applications, 2024, 60(4): 142-152.
[17] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 770-778.
[18] JIAFA M, WEIFENG W, YAHONG H, et al. A scene recognition algorithm based on deep residual network[J]. Systems Science & Control Engineering, 2019, 7(1): 243-251.
[19] HE X, ZHOU Y, ZHAO J, et al. Swin transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-15.
[20] LIN R, ZHANG Y, ZHU X, et al. Local-global feature capture and boundary information refinement swin transformer segmentor for remote sensing images[J]. IEEE Access, 2024, 12: 6088-6099.
[21] 袁姮, 耿仪坤. 特征细化和多尺度注意力的Transformer图像去噪网络[J]. 计算机科学与探索, 2024, 18(7): 1838-1851.
YUAN H, GENG Y K. Feature refinement and multi-scale attention for Transformer image denoising network[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1838-1851.
[22] POUDEL R P K, LIWICKI S, CIPOLLA R. Fast-SCNN: fast semantic segmentation network[EB/OL]. [2023-09-13]. https:// arxiv.org/abs/1902.04502.
[23] 王耀文, 程军圣, 杨宇. 改进的语义分割模型及其应用[J]. 计算机工程与应用, 2024, 60(2): 337-343.
WANG Y W, CHENG J S, YANG Y. Improved semantic segmentation model and its application[J]. Computer Engineering and Applications, 2024, 60(2): 337-343.
[24] CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 801-818.
[25] ROMERA E, ALVAREZ J M, BERGASA L M, et al. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation[J]. IEEE Transactions on Intelligent Transportation Systems, 2017, 19(1): 263-272.
[26] LIANG J, SUN G, ZHANG K, et al. Mutual affine network for spatially variant kernel estimation in blind image super-resolution[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 4096-4105.
[27] WANG H, JIANG X, REN H, et al. SwiftNet: real-time video object segmentation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1296-1305.
[28] WANG L, LI R, ZHANG C,et al. UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190: 196-214.
[29] MA A, WANG J, ZHONG Y, et al. FactSeg: foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-16.
[30] WANG J, SUN K, CHENG T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3349-3364.
[31] CHEN Y, LIN G, LI S, et al. BANet: bidirectional aggregation network with occlusion handling for panoptic segmentation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3793-3802.