计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (5): 1243-1258.DOI: 10.3778/j.issn.1673-9418.2212038

• 理论·算法 • 上一篇    下一篇

紧凑性约束下的形状提取多元时序聚类

张弛,陈梅,张锦宏   

  1. 兰州交通大学 电子与信息工程学院,兰州 730070
  • 出版日期:2024-05-01 发布日期:2024-04-29

Clustering Multivariate Time Series Data Based on Shape Extraction with Compactness Constraint

ZHANG Chi, CHEN Mei, ZHANG Jinhong   

  1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2024-05-01 Published:2024-04-29

摘要: 针对多元时序数据(MTS)的自然性和结构复杂性以及现有算法无法准确识别高维时序数据簇的问题,提出了紧凑性约束下的形状提取多元时间序列聚类算法C-Shape。该算法首先对繁杂的多元时序数据进行最大三角形三段降采样处理,达到使用较少数据而保持原有时序形状不变的目的。然后计算原始时序数据和处理后的时序数据之间的时间序列紧凑性,来评估所定的低维空间维度是否合理。接着在有效保证数据形状完整的基础上使用形状特征提取以确定新的簇中心,最后迭代形成最终簇。C-Shape充分考虑到处理后的数据与原数据形状之间的相似性,解决了传统降采样算法难以确定低维空间维度的难题。为验证算法性能,C-Shape与两个经典算法和七个近年提出的优秀时序聚类算法分别在八个常规和四个不平衡且维数从数十到数千不等的多元时序数据集上进行比较。实验结果显示,C-Shape聚类能力均优于九种对比算法,RI平均提高了16.33%,时间性能平均提高了69.71%。因此,C-Shape是一种精确且高效的多元时间序列聚类算法。

关键词: 多元时间序列聚类, 降采样, 相似度度量, 形状提取, 时间序列紧凑性

Abstract: Aiming at the naturalness and structural complexity of multivariate time series (MTS) data as well as the inability of existing algorithms to accurately identify clusters of high-dimensional time series data, the shape extraction multivariate time series clustering algorithm C-Shape under compactness constraints is proposed. Firstly, C-Shape performs largest triangle three buckets processing on the complex MTS to achieve the purpose of using fewer data while keeping the original shape unchanged. The raw data and the processed data are then selected to calculate the compactness between them to ensure the reduced spatial dimensionality is reasonable. Next, new cluster centers are obtained by using shape extraction while effectively preserving the shape integrity of the data, and the final cluster is formed by iteration. C-Shape can avoid the difficulty of grasping the low dimensional spatial dimensionality of the traditional down-sampling algorithm by fully taking into account the similarity between the shapes of the processed data and raw data. To validate its performance, C-Shape is tested with two classical and seven excellent time series clustering algorithms presented in recent years on the eight normal and four imbalanced MTS datasets with dimensions ranging from tens to thousands, respectively. Experimental results demonstrate all C-Shape clustering capabilities outperform those of the nine baseline algorithms, with an average improvement of 16.33% in Rand index and an average improvement of 69.71% in time performance. Thus C-Shape is an accurate and efficient multivariate time series clustering algorithm.

Key words: multivariate time series clustering, down-sampling, similarity measurement, shape extraction, time series compactness