Cross-Modal Multi-level Feature Fusion for Semantic Segmentation of Remote Sensing Images

doi:10.3778/j.issn.1673-9418.2403082

Abstract

Abstract: Multimodal semantic segmentation networks can leverage complementary information from different modalities to improve segmentation accuracy. Thus, they are highly promising for land cover classification. However, existing multimodal remote sensing image semantic segmentation models often overlook the geometric shape information of deep features and fail to fully utilize multi-layer features before fusion. This results in insufficient cross-modal feature extraction and suboptimal fusion effects. To address these issues, a remote sensing image semantic segmentation model based on multimodal feature extraction and multi-layer feature fusion is proposed. By constructing a dual-branch encoder, the model can separately extract spectral information from remote sensing images and elevation information from normalized digital surface model (nDSM), and deeply explore the geometric shape information of the nDSM. Furthermore, a cross-layer enrichment module is introduced to refine and enhance each layer??s features, making full use of multi-layer feature information from deep to shallow layers. The refined features are then processed through an attention feature fusion module for differential complementarity and cross-fusion, mitigating the differences between branch structures and fully exploiting the advantages of multimodal features, thereby improving the segmentation accuracy of remote sensing images. Experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate mF1 scores of 90.88% and 93.41%, respectively, and mean intersection over union (mIoU) scores of 83.49% and 87.85%, respectively. Compared with current mainstream algorithms, this model achieves more accurate semantic segmentation of remote sensing images.

Key words: remote sensing images, normalized digital surface model (nDSM), semantic segmentation, feature extraction, feature fusion

摘要： 多模态语义分割网络能够利用不同模态中的互补信息来提高分割精度，在地物分类领域具有广泛的应用潜力。然而，现有的多模态遥感影像语义分割模型大多忽略了深度特征的几何形状信息，未将多层特征充分利用就进行融合，导致跨模态特征提取不充分，融合效果不理想。针对这些问题，提出了一种基于多模态特征提取和多层特征融合的遥感影像语义分割模型。通过构建双分支编码器，模型能够分别提取遥感影像的光谱信息和归一化数字表面模型（nDSM）的高程信息，并深入挖掘nDSM的几何形状信息。引入跨层丰富模块细化完善每层特征，从深层到浅层充分利用多层的特征信息。完善后的特征通过注意力特征融合模块，对特征进行差异性互补和交叉融合，以减轻分支结构之间的差异，充分发挥多模态特征的优势，从而提高遥感影像分割精度。在ISPRS Vaihingen和Potsdam数据集上进行实验，mF1分数分别达到了90.88%和93.41%，平均交互比（mIoU）分别达到了83.49%和87.85%，相较于当前主流算法，该算法实现了更准确的遥感影像语义分割。

关键词: 遥感影像, 归一化数字表面模型（nDSM）, 语义分割, 特征提取, 特征融合

LI Zhijie, CHENG Xin, LI Changhua, GAO Yuan, XUE Jingyu, JIE Jun. Cross-Modal Multi-level Feature Fusion for Semantic Segmentation of Remote Sensing Images[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(4): 989-1000.

李智杰, 程鑫, 李昌华, 高元, 薛靖裕, 介军. 跨模态多层特征融合的遥感影像语义分割[J]. 计算机科学与探索, 2025, 19(4): 989-1000.

References

[1] SCHUEGRAF P, SHAN J, BITTNER K. PLANES4LOD2: reconstruction of LoD-2 building models using a depth attention-based fully convolutional neural network[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 211: 425-437.
[2] ZHOU W J, LI Y Z, HUAN J, et al. MSTNet-KD: multilevel transfer networks using knowledge distillation for the dense prediction of remote-sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4504612.
[3] JIN J H, ZHOU W J, YANG R W, et al. Edge detection guide network for semantic segmentation of remote-sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 5000505.
[4] MO Y, GUO Z C, ZHONG R F, et al. Urban functional zone classification using light-detection-and-ranging point clouds, aerial images, and point-of-interest data[J]. Remote Sensing, 2024, 16(2): 386.
[5] LUO H, WANG Z J, DU B, et al. A deep cross-modal fusion network for road extraction with high-resolution imagery and LiDAR data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4503415.
[6] YAN L, HUANG J M, XIE H, et al. Efficient depth fusion transformer for aerial image semantic segmentation[J]. Remote Sensing, 2022, 14(5): 1294.
[7] FOOLADGAR F, KASAEI S. A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks[J]. Multimedia Tools and Applications, 2020, 79(7): 4499-4524.
[8] 毛斌, 韩文泉, 谢宏全, 等. 基于北京二号影像辅助nDSM的建筑物自动提取[J]. 测绘通报, 2022(3): 132-137.
MAO B, HAN W Q, XIE H Q, et al. Construction of building automatic extraction process based on image-aided nDSM of BJ-2[J]. Bulletin of Surveying and Mapping, 2022(3): 132-137.
[9] YANG R, DAI Q, CHENG H, et al. Improving semantic segmentation performance by jointly using high resolution remote sensing image and ndsm[J]. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2022, 3: 77-83.
[10] CHEN K Q, FU K, GAO X, et al. Effective fusion of multi-modal data with group convolutions for semantic segmentation of aerial imagery[C]//Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium. Piscataway: IEEE, 2019: 3911-3914.
[11] BUYUKDEMIRCIOGLU M, CAN R, KOCAMAN S, et al. Deep learning based building footprint extraction from very high resolution true orthophotos and nDSM[J]. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2022, 2: 211-218.
[12] MARCOS D, VOLPI M, KELLENBERGER B, et al. Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2018, 145: 96-107.
[13] MAGGIORI E, TARABALKA Y, CHARPIAT G, et al. High-resolution aerial image labeling with convolutional neural networks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(12): 7092-7103.
[14] AUDEBERT N, LE SAUX B, LEFèVRE S. Beyond RGB: very high resolution urban remote sensing with multimodal deep networks[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2018, 140: 20-32.
[15] ZOU W B, PENG Y Q, ZHANG Z Y, et al. RGB-D gate-guided edge distillation for indoor semantic segmentation[J]. Multimedia Tools and Applications, 2022, 81(25): 35815-35830.
[16] FAN X M, ZHOU W J, QIAN X H, et al. Progressive adjacent-layer coordination symmetric cascade network for semantic segmentation of multimodal remote sensing images[J]. Expert Systems with Applications, 2024, 238: 121999.
[17] ZHOU W J, FAN X M, YU L, et al. MISNet: multiscale cross-layer interactive and similarity refinement network for scene parsing of aerial images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 2025-2034.
[18] LUO H, FENG X B, DU B, et al. A multimodal feature fusion network for building extraction with very high-resolution remote sensing image and LiDAR data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5621819.
[19] CHENG Y H, CAI R, LI Z W, et al. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1475-1483.
[20] LIU H, WU W S, WANG X D, et al. RGB-D joint modelling with scene geometric information for indoor semantic segmentation[J]. Multimedia Tools and Applications, 2018, 77(17): 22475-22488.
[21] DENG L Y, YANG M, LI T Y, et al. RFBNet: deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation[EB/OL]. [2024-01-15]. https://arxiv.org/abs/1907.00135.
[22] CHEN X K, LIN K Y, WANG J B, et al. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 561-577.
[23] ZHOU H, QI L, HUANG H, et al. CANet: co-attention network for RGB-D semantic segmentation[J]. Pattern Recognition, 2022, 124: 108468.
[24] LIU H Y, ZHANG J M, YANG K L, et al. CMX: cross-modal fusion for RGB-X semantic segmentation with transformers[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24: 14679-14694.
[25] CAO J M, LENG H C, LISCHINSKI D, et al. ShapeConv: shape-aware convolutional layer for indoor RGB-D semantic segmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 7068-7077.
[26] ZHOU W J, YANG E Q, LEI J S, et al. FRNet: feature reconstruction network for RGB-D indoor scene parsing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(4): 677-687.
[27] KONECNY G. The international society for photogrammetry and remote sensing (ISPRS) study on the status of mapping in the world[C]//Proceedings of the 2013 International Workshop on Global Geospatial Information. Piscataway: IEEE, 2013: 4-24.
[28] MOU L C, HUA Y S, ZHU X X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019: 12416-12425.
[29] YUE K, YANG L, LI R R, et al. TreeUNet: adaptive tree convolutional neural networks for subdecimeter aerial image segmentation[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2019, 156: 1-13.
[30] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//Proceedings of the 2020 International Conference on Learning Representations, 2020.
[31] SUN X, QIAN Y R, CAO R Y, et al. BGFNet: semantic segmentation network based on boundary guidance[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 21: 2500305.
[32] XIAO T T, LIU Y C, ZHOU B L, et al. Unified perceptual parsing for scene understanding[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 432-448.
[33] ZHANG X R, WENG Z H, ZHU P, et al. ESDINet: efficient shallow-deep interaction network for semantic segmentation of high-resolution aerial images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5607615.
[34] LIU Y C, FAN B, WANG L F, et al. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2018, 145: 78-95.
[35] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[36] LI X, WEN C C, WANG L J, et al. Geometry-aware segmentation of remote sensing images via joint height estimation[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 19: 8007905.
[37] SUN Y, TIAN Y, XU Y P. Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: structural stereotype and insufficient learning[J]. Neurocomputing, 2019, 330: 297-304.
[38] LIU W L, WANG L Q, WANG X H, et al. ULKNet: rethinking large kernel CNN with UNet-attention for remote sensing images semantic segmentation[C]//Proceedings of the 49th Annual Conference of the IEEE Industrial Electronics Society. Piscataway: IEEE, 2023: 1-10.
[39] XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[C]//Advances in Neural Information Processing Systems 34, 2021: 12077-12090.