高斯混合生成模型检测健康数据异常

doi:10.3778/j.issn.1673-9418.2010055

摘要/Abstract

摘要：

在智能穿戴设备普及的背景下,运动手环为全面地了解人们的身体状况提供了丰富的信息源,但是其提供的多维活动数据存在未知的异常值,因此需要进行异常值的检测。由于“维度灾难”,通过传统的方法进行密度估计十分困难,导致检测效果不佳。针对该问题,使用了一种高斯混合生成模型（GMGM）健康数据检测方法。首先,该模型利用变分自编码器（VAE）训练原始数据,并且通过降低重构误差提取潜在特征。然后,利用深度信念网络（DBN）,通过潜在分布和提取的特征来预测样本的混合成员隶属度。接着,变分自编码器、深度信念网络与高斯混合模型（GMM）共同优化,避免了模型解耦的影响。高斯混合模型预测得到每个数据的样本密度,将密度高于训练阶段阈值的样本视为异常。在ODDS标准数据集上验证模型的性能,结果表明,相比深度自编码器高斯混合模型（DAGMM）,GMGM的AUC指标平均提升了5.5个百分点。最后,在真实数据集上的实验结果也表明了该方法的有效性。

关键词: 变分自编码器（VAE）, 深度信念网络（DBN）, 高斯混合模型（GMM）, 健康数据, 异常检测

Abstract:

Sports bracelet provides rich information for a comprehensive understanding of people’s physical health in the context of the popularity of smart wearable devices. However, some unknown outliers inevitably exist in the provided multidimensional activity data and the detection of outliers is necessary. Due to the “dimension disaster”, it is difficult to estimate the density by traditional methods, leading to poor detection performance. Aiming at the problem, a method of detecting health data is utilized, called Gaussian mixture generative model (GMGM). The model uses a variational autoencoder (VAE) to train the original data and latent features can be extracted by minimizing the reconstruction error. Then, the deep belief network (DBN) is used to predict the sample mixture membership with the help of potential distribution and the extracted features. Next, VAE, DBN and Gaussian mixture model (GMM) are optimized together to avoid the influence of model decoupling. Finally, the density of each sample point is predicted by GMM and the samples whose density is higher than the threshold in the training stage will be viewed as outliers. The performance of the GMGM is verified on the ODDS standard datasets. The results show that the model achieves a promotion of 5.5 percentage points for AUC score compared with deep autoencoding Gaussian mixture model (DAGMM). Finally, the experimental results on real datasets also show the effectiveness of GMGM.

Key words: variational autoencoder (VAE), deep brief network (DBN), Gaussian mixture model (GMM), health data, anomaly detection

中图分类号:

TP18

朱壮壮, 周治平. 高斯混合生成模型检测健康数据异常[J]. 计算机科学与探索, 2022, 16(5): 1128-1135.

ZHU Zhuangzhuang, ZHOU Zhiping. Detection of Health Data Based on Gaussian Mixture Generative Model[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1128-1135.

图/表 9

图1 高斯混合生成模型结构示意图

Fig.1 Structure diagram of Gaussian mixture generative model

表1 数据集信息

Table 1 Dataset information

数据集	数据个数	维数	异常点个数（比例）
Ionosphere	351	33	126 (35.90%)
Arrhythmia	452	274	66 (14.60%)
Musk	3 062	166	97 (3.17%)
Speech	3 686	400	61 (1.65%)
Shuttle	49 097	9	3 511 (7.15%)

图2 各算法检测Speech数据集的ROC曲线

Fig.2 ROC curves of each algorithm for Speech

图3 各数据集在GMGM上的不同 o对应AUC值

Fig.3 AUC curves with different ofor different datasets on GMGM

表2 各算法平均检测时间对比

Table 2 Comparison of average detectiontime of each algorithm

算法	平均检测时间/s
SOS	2.50
VAE	0.52
DAGMM	1.01
本文算法	0.64

表3 不同模型结构实验结果对比

Table 3 Comparison of experimental results of different model structures

模型结构	数据集	ACC	Recall	$F 1 - S c o r e$	AUC
本文模型	Ionosphere	0.871	0.481	0.501	0.883
	Arrhythmia	0.821	0.422	0.497	0.832
	Musk	0.995	0.988	0.985	0.998
	Speech	0.920	0.969	0.958	0.918
	Shuttle	0.972	0.427	0.462	0.974
各模型独立训练	Ionosphere	0.821	0.401	0.399	0.808
	Arrhythmia	0.698	0.312	0.386	0.701
	Musk	0.776	0.833	0.897	0.801
	Speech	0.835	0.871	0.834	0.881
	Shuttle	0.911	0.918	0.937	0.937

表3 不同模型结构实验结果对比

Table 3 Comparison of experimental results of different model structures

模型结构	数据集	ACC	Recall	$F 1 - S c o r e$	AUC
本文模型	Ionosphere	0.871	0.481	0.501	0.883
	Arrhythmia	0.821	0.422	0.497	0.832
	Musk	0.995	0.988	0.985	0.998
	Speech	0.920	0.969	0.958	0.918
	Shuttle	0.972	0.427	0.462	0.974
各模型独立训练	Ionosphere	0.821	0.401	0.399	0.808
	Arrhythmia	0.698	0.312	0.386	0.701
	Musk	0.776	0.833	0.897	0.801
	Speech	0.835	0.871	0.834	0.881
	Shuttle	0.911	0.918	0.937	0.937

表4 不同算法实验结果对比

Table 4 Comparison of experimental results of different algorithms

数据集	算法	ACC	Recall	$F 1 - S c o r e$	AUC
Ionosphere	SOS	0.727	0.356	0.393	0.763
	VAE	0.807	0.339	0385	0.758
	DAGMM	0.834	0.436	0.446	0.838
	本文算法	0.871	0.481	0.501	0.883
Arrhythmia	SOS	0.675	0.227	0.315	0.577
	VAE	0.501	0.341	0.373	0.503
	DAGMM	0.804	0.254	0.411	0.622
	本文算法	0.821	0.422	0.497	0.832
Musk	SOS	0.677	0.743	0.698	0.740
	VAE	0.727	0.741	0.785	0.742
	DAGMM	0.814	0.826	0.852	0.828
	本文算法	0.995	0.988	0.985	0.998
Speech	SOS	0.756	0.783	0.757	0.727
	VAE	0.722	0.757	0.778	0.809
	DAGMM	0.888	0.890	0.923	0.921
	本文算法	0.920	0.969	0.958	0.918
Shuttle	SOS	0.884	0.442	0.269	0.773
	VAE	0.839	0.261	0.302	0.871
	DAGMM	0.991	0.307	0.393	0.960
	本文算法	0.972	0.427	0.463	0.974

表4 不同算法实验结果对比

Table 4 Comparison of experimental results of different algorithms

数据集	算法	ACC	Recall	$F 1 - S c o r e$	AUC
Ionosphere	SOS	0.727	0.356	0.393	0.763
	VAE	0.807	0.339	0385	0.758
	DAGMM	0.834	0.436	0.446	0.838
	本文算法	0.871	0.481	0.501	0.883
Arrhythmia	SOS	0.675	0.227	0.315	0.577
	VAE	0.501	0.341	0.373	0.503
	DAGMM	0.804	0.254	0.411	0.622
	本文算法	0.821	0.422	0.497	0.832
Musk	SOS	0.677	0.743	0.698	0.740
	VAE	0.727	0.741	0.785	0.742
	DAGMM	0.814	0.826	0.852	0.828
	本文算法	0.995	0.988	0.985	0.998
Speech	SOS	0.756	0.783	0.757	0.727
	VAE	0.722	0.757	0.778	0.809
	DAGMM	0.888	0.890	0.923	0.921
	本文算法	0.920	0.969	0.958	0.918
Shuttle	SOS	0.884	0.442	0.269	0.773
	VAE	0.839	0.261	0.302	0.871
	DAGMM	0.991	0.307	0.393	0.960
	本文算法	0.972	0.427	0.463	0.974

图4 GMGM算法在健康数据上异常检测结果

Fig.4 Detection results by GMGM on health data

图5 DAGMM算法在健康数据上异常检测结果

Fig.5 Detection results by DAGMM on health data

参考文献 16

[1]	LIM W K, DAVILA S, TEO J X, et al. Beyond fitness tracking: the use of consumer-grade wearable data from normal volunteers in cardiovascular and lipidomics research[J]. PLoS Biology, 2018, 16(2): e2004285. DOI URL
[2]	STOJANOVIC N, DINIC M, STOJANOVIC L. A data-driven approach for multivariate contextualized anomaly detection: industry use case[C]// Proceedings of the 2017 IEEE Interna-tional Conference on Big Data, Boston, Dec 11-14, 2017. Washington: IEEE Computer Society, 2017: 1560-1569.
[3]	CAMPOS G O, ZIMEK A, SANDER J, et al. on the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study[J]. Data Mining and Knowledge Discovery, 2016, 30(4): 891-927. DOI URL
[4]	CANDÈS E J, LI X, MA Y, et al. Robust principal component analysis?[J]. Journal of the ACM, 2011, 58(3): 1-37.
[5]	ZHOU C, PAFFENROTH R C. Anomaly detection with robust deep autoencoders[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Aug 13-17, 2017. New York: ACM, 2017: 665-674.
[6]	ZONG B, SONG Q, MIN M R, et al. Deep autoencoding Gaussian mixture model for unsupervised anomaly detection[C]// Proceedings of the 6th International Conference on Learning Representations, Vancouver, Apr 30-May 3, 2018.
[7]	NALISNICK E, HERTEL L, SMYTH P. Approximate inference for deep latent Gaussian mixtures[C]// Proceedings of the 2016 NIPS Workshop on Bayesian Deep Learning. Red Hook: Curran Associates, 2016: 131.
[8]	刘少钦, 唐爽, 赵俊峰, 等. 基于扩展主题模型的异常医疗处方检测方法[J]. 计算机科学与探索, 2020, 14(1): 30-39.
	LIU S Q, TANG S, ZHAO J F, et al. Extended topic model based abnormal medical prescription detection method[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(1): 30-39.
[9]	AN J, CHO S. Variational autoencoder based anomaly detection using reconstruction probability[J]. Special Lecture on IE, 2015, 2(1): 1-18.
[10]	REYNOLDS D A. Gaussian mixture models[J]. Encyclopedia of Biometrics, 2009, 741: 659-663.
[11]	LEE H, PHAM P, LARGMAN Y, et al. Unsupervised feature learning for audio classification using convolutional deep belief networks[C]// Advances in Neural Information Processing Systems 22: 23rd Annual Coference on Neural Information Processing Systems 2009, Vancouver, Dec 7-10, 2009. Red Hook: Carran Associates, 2009: 1096-1104.
[12]	CAMACHO J, PÉREZ-VILLEGAS A, GARCÍA-TEODORO P, et al. PCA-based multivariate statistical network monitoring for anomaly detection[J]. Computers & Security, 2016, 59: 118-137. DOI URL
[13]	ZHAI S F, CHENG Y, LU W N, et al. Deep structured energy based models for anomaly detection[C]// Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1100-1109.
[14]	JANSSENS J H M, HUSZAR F, POSTMA E O, et al. Stochastic outlier selection: TiCC TR 2012-001[R]. Tilburg Center for Cognition and Communication, 2012.
[15]	KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]// Proceedings of the 2nd International Conference on Learning Representations, Banff, Apr 14-16, 2014: 1-14.
[16]	VRIEZE S I. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)[J]. Psychological Methods, 2012, 17(2): 228. DOI URL

编辑推荐 0

Metrics

阅读次数

全文

175

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	3	0	0	172

来源	本网站	其他网站

次数	168	7
比例	96%	4%

摘要

232

最新录用	在线预览	正式出版

0	0	232

	来源	本网站

	次数	232
	比例	100%