基于掩码自监督学习的点云动作识别方法

doi:10.3778/j.issn.1673-9418.2404045

摘要/Abstract

摘要： 点云动作识别方法可以提供精准的三维动作监测与识别服务，在智能交互、智能安防和医疗健康等领域具有广阔应用前景。现有方法通常利用大量标注的点云数据训练模型，但点云视频包含大量的三维坐标，精准标注点云非常昂贵，同时点云视频高度冗余，点云信息在视频中分布不均，这些问题都增大了标注的难度。为解决上述问题并获得更好的点云动作识别性能，本文提出一种无需人工标注即可捕获点云视频时空结构的掩码自监督动作识别方法MSTD-Transformer。具体地，将点云视频划分为点管并根据重要性进行自适应视频级掩码，通过点云重构和运动预测双流自监督学习点云视频的外观和运动特征。为了更好地捕获运动信息，MSTD-Transformer从点云关键点的位移中提取动态注意力并嵌入Transformer，使用双分支结构进行差异化学习，分别捕获运动信息和全局结构。在标准数据集MSRAction-3D上的实验结果表明，本文提出的方法对24帧点云视频动作识别准确率为96.17%，较现有最好方法高出2.09%，证实了掩码策略和动态注意力的有效性。

关键词: 动作识别, 点云, 自监督学习, 掩码, 注意力机制

Abstract: Point cloud action recognition methods can provide precise 3D motion monitoring and recognition services, with broad application prospects in fields such as intelligent interaction, intelligent security, and medical health. Existing methods typically use a large amount of annotated point cloud data to train models, but point cloud videos contain a large number of 3D coordinates, precise annotation of point clouds is very expensive, and point cloud videos are highly redundant with uneven distribution of point cloud information in the video, all of which increase the difficulty of annotation. To address the aforementioned issue and achieve superior performance in point cloud action recognition, a novel mask self-supervised action recognition method called MSTD-Transformer is proposed, which can capture the spatiotemporal structure of point cloud videos without the need for manual annotation. Specifically, the point cloud video is divided into point tubes and adaptive video-level masks are generated based on importance, learning the appearance and motion features of point cloud videos through self-supervised learning of point cloud reconstruction and motion prediction dual-stream. To better capture motion information, MSTD-Transformer extracts dynamic attention from the displacement of point cloud keypoints and embeds it into a Transformer, using a dual-branch structure for differential learning to capture motion information and global structure separately. Experimental results on the standard dataset MSRAction-3D show that the proposed method achieves an accuracy of 96.17% for 24-frame point cloud video action recognition, which is 2.09% higher than the best existing method, confirming the effectiveness of the mask strategy and dynamic attention.

Key words: action recognition, point cloud, self-supervised learning, mask, attention mechanism

何允栋, 李平, 平晨昊. 基于掩码自监督学习的点云动作识别方法[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2404045.

[1]	王永贵, 刘丹妮. 融合多个性化桥和自监督学习的跨域推荐算法[J]. 计算机科学与探索, 2024, 18(7): 1792-1805.
[2]	温雯, 邓峰颖, 郝志峰, 蔡瑞初, 梁方宇. 时空邻域感知的时序兴趣点推荐[J]. 计算机科学与探索, 2024, 18(7): 1865-1878.
[3]	王国凯, 张翔, 王顺芳. 多尺度和边界融合的皮肤病变区域分割网络[J]. 计算机科学与探索, 2024, 18(7): 1826-1837.
[4]	韩涵, 黄训华, 常慧慧, 樊好义, 陈鹏, 陈姞伽. 心电领域中的自监督学习方法综述[J]. 计算机科学与探索, 2024, 18(7): 1683-1704.
[5]	陈东洋, 毛力. 融合增量学习与Transformer模型的股价预测研究[J]. 计算机科学与探索, 2024, 18(7): 1889-1899.
[6]	夏庆锋, 许可儿, 李明阳, 胡凯, 宋利鹏, 宋志强, 孙宁. 强化学习中的注意力机制研究综述[J]. 计算机科学与探索, 2024, 18(6): 1457-1475.
[7]	杨力, 钟俊弘, 张赟, 宋欣渝. 基于复合跨模态交互网络的时序多模态情感分析[J]. 计算机科学与探索, 2024, 18(5): 1318-1327.
[8]	王香, 毛力, 陈祺东, 孙俊. 融合动态梯度和多视图协同注意力的情感分析[J]. 计算机科学与探索, 2024, 18(5): 1328-1338.
[9]	章淯淞, 夏鸿斌, 刘渊. 自监督混合图神经网络的会话推荐模型[J]. 计算机科学与探索, 2024, 18(4): 1021-1031.
[10]	王龙业, 肖越, 曾晓莉, 张凯信, 马傲. 王龙业，肖越，曾晓莉，张凯信，马傲[J]. 计算机科学与探索, 2024, 18(4): 978-989.
[11]	陈林颖, 刘建华, 郑智雄, 林杰, 徐戈, 孙水华. 多特征交互的方面情感三元组提取[J]. 计算机科学与探索, 2024, 18(4): 1057-1067.
[12]	周燕, 李文俊, 党兆龙, 曾凡智, 叶德旺. 深度学习的三维模型识别研究综述[J]. 计算机科学与探索, 2024, 18(4): 916-929.
[13]	林穗, 卢超海, 姜文超, 林晓珊, 周蔚林. 融合选择注意力的小样本知识图谱补全模型[J]. 计算机科学与探索, 2024, 18(3): 646-658.
[14]	彭斌, 白静, 李文静, 郑虎, 马向宇. 面向图像分类的视觉Transformer研究进展[J]. 计算机科学与探索, 2024, 18(2): 320-344.
[15]	祁宣豪, 智敏. 图像处理中注意力机制综述[J]. 计算机科学与探索, 2024, 18(2): 345-362.

基于掩码自监督学习的点云动作识别方法

Point Cloud Action Recognition Method Based on Masked Self-supervised Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics