计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (10): 2762-2769.DOI: 10.3778/j.issn.1673-9418.2310068

• 人工智能·模式识别 • 上一篇    下一篇

多尺度视觉特征提取及跨模态对齐的连续手语识别

郭乐铭,薛万利,袁甜甜   

  1. 1. 天津理工大学 计算机科学与工程学院,天津 300384
    2. 天津理工大学 聋人工学院,天津 300384
  • 出版日期:2024-10-01 发布日期:2024-09-29

Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition

GUO Leming, XUE Wanli, YUAN Tiantian   

  1. 1. School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
    2. Technical College for the Deaf, Tianjin University of Technology, Tianjin 300384, China
  • Online:2024-10-01 Published:2024-09-29

摘要: 连续手语识别研究中,视觉特征的有效表示是提升识别效果的关键。然而,手语动作时序长度的差异性及手语弱标注现象,使得有效的视觉特征提取更加困难。针对上述问题,提出了多尺度视觉特征提取及跨模态对齐的连续手语识别方法(MECA)。该方法主要包含多尺度视觉特征提取模型和跨模态对齐约束。在多尺度视觉特征提取模型中,并行地融合具备不同扩张因子的瓶颈残差结构,来丰富多尺度时序感受野,用于提取不同时序长度的手语视觉特征,同时采用层级复用设计进一步强化视觉特征表示。在跨模态对齐约束中,采用动态时间规整建模手语视觉特征和文本特征之间的内在联系,其中,文本特征提取由多层感知机和长短期记忆网络协作实现。在具备挑战性的公开数据集RWTH-2014、RWTH-2014T、CSL-Daily上进行实验,结果表明所提方法达到目前具有竞争力的性能。上述实验验证了所提的采用多尺度的方式可以捕捉不同时序长度的手语动作,以及构建跨模态对齐约束的思路是正确且有效的,适用于弱监督条件下的连续手语识别任务。

关键词: 连续手语识别, 多尺度, 跨模态对齐约束, 视频视觉特征, 文本特征

Abstract: Effective representation of visual feature extraction is the key to improving continuous sign language recognition performance. However, the differences in the temporal length of sign language actions and the sign language weak annotation problem make effective visual feature extraction more difficult. To focus on the above problems, a method named multi-scale visual feature extraction and cross-modality alignment for continuous sign language recognition (MECA) is proposed. The method mainly consists of a multi-scale visual feature extraction module and cross-modal alignment constraints. Specifically, in the multi-scale visual feature extraction module, the bottleneck residual structures with different dilated factors are fused in parallel to enrich the multi-scale temporal receptive field for extracting visual features with different temporal lengths. Furthermore, the hierarchical reuse design is adopted to further strengthen the visual feature. In the cross-modality alignment constraint, dynamic time warping is used to model the intrinsic relationship between sign language visual features and textual features, where textual feature extraction is achieved by the collaboration of a multilayer perceptron and a long short-term memory network. Experiments performed on the challenging public datasets RWTH-2014, RWTH-2014T and CSL-Daily show that the proposed method achieves competitive performance. The results demonstrate that the multi-scale approach proposed in MECA can capture sign language actions of distinct temporal lengths, and constructing the cross-modal alignment constraint is correct and effective for continuous sign language recognition under weak supervision.

Key words: continuous sign language recognition, multi-scale, cross-modal alignment constraints, video visual features, text features