计算机科学与探索

• 学术研究 •    下一篇

基于掩码自监督学习的点云动作识别方法

何允栋, 李平, 平晨昊   

  1. 杭州电子科技大学 计算机学院,杭州 310018

Point Cloud Action Recognition Method Based on Masked Self-supervised Learning

HE Yundong, LI Ping, PING Chenhao   

  1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China

摘要: 点云动作识别方法可以提供精准的三维动作监测与识别服务,在智能交互、智能安防和医疗健康等领域具有广阔应用前景。现有方法通常利用大量标注的点云数据训练模型,但点云视频包含大量的三维坐标,精准标注点云非常昂贵,同时点云视频高度冗余,点云信息在视频中分布不均,这些问题都增大了标注的难度。为解决上述问题并获得更好的点云动作识别性能,本文提出一种无需人工标注即可捕获点云视频时空结构的掩码自监督动作识别方法MSTD-Transformer。具体地,将点云视频划分为点管并根据重要性进行自适应视频级掩码,通过点云重构和运动预测双流自监督学习点云视频的外观和运动特征。为了更好地捕获运动信息,MSTD-Transformer从点云关键点的位移中提取动态注意力并嵌入Transformer,使用双分支结构进行差异化学习,分别捕获运动信息和全局结构。在标准数据集MSRAction-3D上的实验结果表明,本文提出的方法对24帧点云视频动作识别准确率为96.17%,较现有最好方法高出2.09%,证实了掩码策略和动态注意力的有效性。

关键词: 动作识别, 点云, 自监督学习, 掩码, 注意力机制

Abstract: Point cloud action recognition methods can provide precise 3D motion monitoring and recognition services, with broad application prospects in fields such as intelligent interaction, intelligent security, and medical health. Existing methods typically use a large amount of annotated point cloud data to train models, but point cloud videos contain a large number of 3D coordinates, precise annotation of point clouds is very expensive, and point cloud videos are highly redundant with uneven distribution of point cloud information in the video, all of which increase the difficulty of annotation. To address the aforementioned issue and achieve superior performance in point cloud action recognition, a novel mask self-supervised action recognition method called MSTD-Transformer is proposed, which can capture the spatiotemporal structure of point cloud videos without the need for manual annotation. Specifically, the point cloud video is divided into point tubes and adaptive video-level masks are generated based on importance, learning the appearance and motion features of point cloud videos through self-supervised learning of point cloud reconstruction and motion prediction dual-stream. To better capture motion information, MSTD-Transformer extracts dynamic attention from the displacement of point cloud keypoints and embeds it into a Transformer, using a dual-branch structure for differential learning to capture motion information and global structure separately. Experimental results on the standard dataset MSRAction-3D show that the proposed method achieves an accuracy of 96.17% for 24-frame point cloud video action recognition, which is 2.09% higher than the best existing method, confirming the effectiveness of the mask strategy and dynamic attention.

Key words: action recognition, point cloud, self-supervised learning, mask, attention mechanism