计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (12): 3290-3302.DOI: 10.3778/j.issn.1673-9418.2504054

• 图形·图像 • 上一篇    下一篇

结合CNN和多尺度视觉状态空间的遥感图像语义分割网络

蔺月妮,汪西莉   

  1. 陕西师范大学 人工智能与计算机学院,西安 710119
  • 出版日期:2025-12-01 发布日期:2025-12-01

CNN and Multi-scale Visual State Space Network for Semantic Segmentation of Remote Sensing Images

LIN Yueni, WANG Xili   

  1. School of Artificial Intelligence and Computer Science, Shaanxi Normal University, Xi’an 710119, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 现有的遥感图像语义分割方法面临显著挑战:基于卷积神经网络(CNN)的方法缺乏远程建模能力,在复杂遥感场景中的分割性能受限;基于Transformer的方法计算复杂度随输入图像尺寸呈平方级增长,难以兼顾分割性能与计算效率。最近,视觉状态空间模型(VSS)因能够以线性计算复杂度建模全局依赖关系而受到广泛关注。针对上述问题,提出了一种结合CNN与VSS的遥感图像语义分割网络,旨在同时兼顾性能与效率。网络由基于CNN构成的编码器和基于VSS的解码器组成,用于建模局部信息并捕获远程上下文依赖关系。引入多尺度深度卷积和坐标注意力机制,构建多尺度前馈网络(MSFFN)替换原始VSS中的前馈网络(FFN),以缓解顺序扫描机制带来的2D图像局部区域空间像素不连续问题,同时增强多尺度特征表示。此外,设计空间通道聚合增强模块(SCAEM),充分融合编码器浅层细节信息和解码器全局语义信息,实现高效特征聚合。使用辅助分割头优化梯度传播和特征学习的方向,促进更准确的分割结果输出。在Vaihingen、Potsdam和LoveDA数据集上与一些先进的语义分割网络进行了对比实验,实验结果表明,提出的网络在这三个公共数据集上的表现优于其他分割网络。

关键词: 遥感图像, 语义分割, 视觉状态空间, 多尺度特征, 卷积神经网络

Abstract: Existing methods for semantic segmentation of remote sensing images face significant challenges: convolutional neural network (CNN)-based methods lack remote modeling capability and have limited segmentation efficacy in complex scenes. Transformer-based methods have a computational complexity that grows in square steps with the size of the input image, which makes it difficult to balance segmentation performance and computational efficiency. Recently, visual state space (VSS) has received much attention for its ability to model global dependencies with linear computational complexity. A semantic segmentation network for remote sensing images combining CNN and VSS is proposed to address the above problems, aiming to balance the performance and efficiency at the same time. Specifically, the network consists of a CNN-based encoder and a VSS-based decoder for extracting local correlations and capturing long-range contextual dependencies. The multi-scale deep convolution and coordinate attention mechanisms are introduced to construct a multi-scale feed-forward network (MSFFN) to replace the feed-forward network (FFN) in the original VSS, in order to address the token fragmentation issue within local 2D image regions caused by sequential scanning mechanisms, while enhancing the multi-scale feature representation. The spatial channel aggregated enhancement module (SCAEM) is designed to fully fuse the shallow detail information of the encoder and the global semantic information of the decoder to achieve efficient feature aggregation. An auxiliary segmentation head aids gradient propagation and feature refinement, leading to superior segmentation outputs. Comparison experiments with some state-of-the-art semantic segmentation methods on Vaihingen, Potsdam and LoveDA datasets are conducted, and the experimental results show that the proposed network outperforms other segmentation networks on these three public datasets.

Key words: remote sensing images, semantic segmentation, visual state space, multi-scale features, convolutional neural network