计算机科学与探索

• 学术研究 •    下一篇

三维重排MLP驱动的跨维交互式弱监督视频异常检测

张一帆,严豫,刘特立,陈鹏   

  1. 1.中国人民公安大学 信息网络安全学院,北京 100038
    2.中国科学院 计算技术研究所,北京 100089

Three-Dimensional rearrangement MultiLayer Perceptron driven Cross-Dimensional Feature Interaction for Weakly-supervised Video Anomaly

ZHANG Yifan,  YAN Yu,  LIU Teli,  CHEN Peng   

  1. 1. College of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China
    2. Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100089, Chinax

摘要: 视频异常检测(Video Anomaly Detection,VAD)已成为计算机视觉领域的一项重要任务。目前,弱监督视频异常检测(Weakly-supervised Video Anomaly Detection,WVAD)已经成为视频异常检测的主流方法之一。然而,现有方法存在将各视频段视为独立同分布的示例,忽略了视频级时空依赖关系的问题。针对上述问题,提出了一种三维重排MLP驱动的跨维度特征交互式弱监督视频异常检测方法(Three-Dimensional rearrangement MultiLayer Perceptron driven Cross-Dimensional Feature Interaction for Weakly-supervised Video Anomaly Detection,rMLP-WVAD)。首先,使用I3D编码器提取视频帧的多尺度特征,并通过三维重排MLP驱动的视频级特征交互及时空注意力机制(Video-level feature interaction and Spatio-temporal Attention,VSA)进行特征增强,以保留视频级的时空依赖关系并充分挖掘关键的跨维度异常特征。然后,随着跨维度特征被进一步挖掘和丰富,如何更加精准地定义并量化“异常”便成为能否有效检测视频异常的关键。为此,提出将特征与加权平均特征的差异(Divergence of Feature from Weighted Mean vector,DFWM)作为异常判别标准,以充分利用增强后的时空特征表达并更准确地量化“异常”并提升检测的性能。最后,在公开数据集上的实验结果显示,rMLP-WVAD在XD-Violence数据集上的AP达到86.39%;在UCF-Crime数据集上的AUC达到85.70%,验证了该方法的有效性。

关键词: 视频异常检测, 弱监督, 视频级特征, 时空注意力, 加权平均特征

Abstract: Video Anomaly Detection (VAD) has become an important task in the field of computer vision.Currently, Weakly-supervised Video Anomaly Detection (WVAD) has become one of the mainstream methods for video anomaly detection.However, the existing methods suffer from the problem of treating each video snipper as an independent and identically instance, ignoring the spatio-temporal dependencies at the video level.To address the problem,Three-Dimensional rearrangement MultiLayer Perceptron driven Cross-Dimensional Feature Interaction for Weakly-supervised Video Anomaly Detection(rMLP-WVAD)is proposed.First, multi-scale features of video frames are extracted using I3D encoder, and the features are feature-enhanced by 3D rearrangement of MLP-driven Video-level feature interaction and Spatio-temporal Attention (VSA) to preserve video-level spatio-temporal dependencies and to fully retain key spatio-temporal contextual information.Then, as the cross-dimensional features are further mined and enriched, how to define and quantify "anomaly" more accurately becomes the key to effectively detecting abnormal events.For this reason,the Divergence of Feature from Weighted Mean vector (DFWM) is proposed as an anomaly criterion to fully utilize the enhanced spatio-temporal feature representation and quantify anomalies more accurately and improve the performance of detection.Finally, the experimental results on public datasets show that rMLP-WVAD achieves an AP of 86.39% on the XD-Violence dataset, and the AUC on the UCF-Crime dataset reaches 85.70%, which verifies the validity of the method.

Key words: Video Anomaly Detection, Weak-supervised, Weighted Features, Video-grade features, Spatio-Temporal Attention