Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition

doi:10.3778/j.issn.1673-9418.2310068

Abstract

Abstract: Effective representation of visual feature extraction is the key to improving continuous sign language recognition performance. However, the differences in the temporal length of sign language actions and the sign language weak annotation problem make effective visual feature extraction more difficult. To focus on the above problems, a method named multi-scale visual feature extraction and cross-modality alignment for continuous sign language recognition (MECA) is proposed. The method mainly consists of a multi-scale visual feature extraction module and cross-modal alignment constraints. Specifically, in the multi-scale visual feature extraction module, the bottleneck residual structures with different dilated factors are fused in parallel to enrich the multi-scale temporal receptive field for extracting visual features with different temporal lengths. Furthermore, the hierarchical reuse design is adopted to further strengthen the visual feature. In the cross-modality alignment constraint, dynamic time warping is used to model the intrinsic relationship between sign language visual features and textual features, where textual feature extraction is achieved by the collaboration of a multilayer perceptron and a long short-term memory network. Experiments performed on the challenging public datasets RWTH-2014, RWTH-2014T and CSL-Daily show that the proposed method achieves competitive performance. The results demonstrate that the multi-scale approach proposed in MECA can capture sign language actions of distinct temporal lengths, and constructing the cross-modal alignment constraint is correct and effective for continuous sign language recognition under weak supervision.

Key words: continuous sign language recognition, multi-scale, cross-modal alignment constraints, video visual features, text features

摘要： 连续手语识别研究中，视觉特征的有效表示是提升识别效果的关键。然而，手语动作时序长度的差异性及手语弱标注现象，使得有效的视觉特征提取更加困难。针对上述问题，提出了多尺度视觉特征提取及跨模态对齐的连续手语识别方法（MECA）。该方法主要包含多尺度视觉特征提取模型和跨模态对齐约束。在多尺度视觉特征提取模型中，并行地融合具备不同扩张因子的瓶颈残差结构，来丰富多尺度时序感受野，用于提取不同时序长度的手语视觉特征，同时采用层级复用设计进一步强化视觉特征表示。在跨模态对齐约束中，采用动态时间规整建模手语视觉特征和文本特征之间的内在联系，其中，文本特征提取由多层感知机和长短期记忆网络协作实现。在具备挑战性的公开数据集RWTH-2014、RWTH-2014T、CSL-Daily上进行实验，结果表明所提方法达到目前具有竞争力的性能。上述实验验证了所提的采用多尺度的方式可以捕捉不同时序长度的手语动作，以及构建跨模态对齐约束的思路是正确且有效的，适用于弱监督条件下的连续手语识别任务。

关键词: 连续手语识别, 多尺度, 跨模态对齐约束, 视频视觉特征, 文本特征

GUO Leming, XUE Wanli, YUAN Tiantian. Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2762-2769.

郭乐铭, 薛万利, 袁甜甜. 多尺度视觉特征提取及跨模态对齐的连续手语识别[J]. 计算机科学与探索, 2024, 18(10): 2762-2769.

References

[1] SUTTON-SPENCE R, WOLL B. The linguistics of British sign language: an introduction[M]. Cambridge: Cambridge University Press, 1999.
[2] 闫思伊, 薛万利, 袁甜甜. 手语识别与翻译综述[J]. 计算机科学与探索, 2022, 16(11): 2415-2429.
YAN S Y, XUE W L, YUAN T T. Survey of sign language recognition and translation[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(11): 2415-2429.
[3] BRAGG D, KOLLER O, BELLARD M, et al. Sign language recognition, generation, and translation: an interdisciplinary perspective[C]//Proceedings of the 2019 ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, Oct 28-30, 2019. New York: ACM, 2019: 16-31.
[4] RASTGOO R, KIANI K, ESCALERA S. Sign language recog-nition: a deep survey[J]. Expert Systems with Applications, 2021, 164: 113794.
[5] KOLLER O, FORSTER J, NEY H. Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers[J]. Computer Vision and Image Understanding, 2015, 141: 108-125.
[6] HUANG J, ZHOU W G, ZHANG Q L, et al. Video-based sign language recognition without temporal segmentation[C]//Proceedings of the 2018 AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 2257-2264.
[7] CUI R P, LIU H, ZHANG C S. A deep neural framework for continuous sign language recognition by iterative training[J]. IEEE Transactions on Multimedia, 2019, 21(7): 1880-1891.
[8] LI D X, XU C C, YU X, et al. TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 12034-12045.
[9] XIE P, CUI Z, DU Y, et al. Multi-scale local-temporal similarity fusion for continuous sign language recognition[J]. Pattern Recognition, 2023, 136: 109233.
[10] WANG L M, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Washington: IEEE Computer Society, 2021: 1895-1904.
[11] LI S J, ABUFARHA Y, LIU Y, et al. MS-TCN++: multi-stage temporal convolutional network for action segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6647-6658.
[12] 黄菲菲, 曹江涛, 姬晓飞, 等. 多特征的双人交互动作识别算法研究[J]. 计算机科学与探索, 2017, 11(2): 294-302.
HUANG F F, CAO J T, JI X F, et al. Research on human interaction recognition algorithm based on mixed features[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(2): 294-302.
[13] PU J F, ZHOU W G, LI H Q. Iterative alignment network for continuous sign language recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4165-4174.
[14] PU J F, ZHOU W G, HU H Z, et al. Boosting continuous sign language recognition via cross modality augmentation[C]//Proceedings of the 2020 ACM International Conference on Multimedia, Oct 12-16, 2020. New York: ACM, 2020: 1497-1505.
[15] PU J F, ZHOU W G, LI H Q. Dilated convolutional network with iterative optimization for continuous sign language recognition[C]//Proceedings of the 2018 International Joint Conference on Artificial Intelligence, Stockholm, Jul 13-19, 2018: 885-891.
[16] NIU Z, MAK B. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 172-186.
[17] CAMGOZ N C, KOLLER O, HADFIELD S, et al. Sign language transformers: joint end-to-end sign language recognition and translation[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 10020-10030.
[18] CAMGOZ N C, HADFIELD S, KOLLER O, et al. Neural sign language translation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7784-7793.
[19] ZHOU H, ZHOU W G, QI W Z, et al. Improving sign language translation with monolingual data by sign back-translation[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Washington: IEEE Computer Society, 2021: 1316-1325.
[20] GRAVES A, FERNANDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 2006 International Conference on Machine Learning, Pittsburgh, Jun 25-29, 2006: 369-376.
[21] CUTURI M, BLONDEL M. SOFT-DTW: a differentiable loss function for time-series[C]//Proceedings of the 2017 International Conference on Machine Learning, Sydney, Aug 11-15, 2017: 894-903.
[22] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[23] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[C]//Proceedings of the 2016 International Conference on Learning Representations, San Juan, May 2-4, 2016.
[24] MIN Y C, HAO A M, CHAI X L, et al. Visual alignment constraint for continuous sign language recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 11542-11551.
[25] HAO A M, MIN Y C, CHEN X L. Self-mutual distillation learning for continuous sign language recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 11303-11312.
[26] KOLLER O, CAMGOZ N C, NEY H, et al. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(9): 2306-2320.