时间序列特征表示与相似性度量研究综述

doi:10.3778/j.issn.1673-9418.2003063

摘要/Abstract

摘要：

时间序列是将同一指标的数值按照时间的先后顺序排列组成的一组随机数列。随着科学技术的蓬勃发展，时间序列在数据挖掘领域中的应用变得越来越广泛。综合分析了近年来时间序列在数据挖掘领域的文献成果，对时间序列特征表示和相似性度量方法进行了阐述。针对时间序列特征表示方法，从非数据适应性方法、数据自适应性方法、基于模型的方法三方面进行说明，对各种主要方法的研究现状、优缺点、适用领域、方法特性以及局限性等进行了比较分析。针对时间序列的相似性度量方法，从基于形状的相似性度量方法、基于模型的相似性度量方法和基于数据压缩的相似性度量方法三方面进行系统描述，对各种主要方法的优缺点、适用领域等进行介绍，并从是否支持非等长时间序列之间的比较、是否支持平移、是否支持三角不等式等方面进行了比较分析。最后，对时间序列的未来研究方向进行了展望。

关键词: 数据挖掘, 时间序列, 特征表示, 相似性度量

Abstract:

Time series is a group of random numbers which are composed of the values of the same index according to the time sequence. With the rapid development of science and technology, the application of time series in the field of data mining becomes more and more extensively. This paper comprehensively analyzes the literature achi-evements of time series in the field of data mining in recent years, and expounds the methods of time series in feature representation and similarity measurement. For the feature representation methods of time series, the non-data adaptive methods, data self-adaptive methods and model-based methods are introduced. The research status, advantages and disadvantages, application fields, method characteristics and limitations of various main methods are compared and analyzed. For the similarity measurement methods of time series, the shape-based similarity measure-ment methods, model-based similarity measurement methods and data-compression-based similarity measurement methods are described systematically. The advantages and disadvantages of various main methods and their applica-tion fields are introduced. Some characteristics of different aspects are also compared and analyzed, such as whether to support the comparison between unequal length time series, whether to support translation, and whether to support trigonometric inequality. Finally, the future research direction of time series is prospected.

Key words: data mining, time series, feature representation, similarity measurement

孙冬璞, 曲丽. 时间序列特征表示与相似性度量研究综述[J]. 计算机科学与探索, 2021, 15(2): 195-205.

SUN Dongpu, QU Li. Survey on Feature Representation and Similarity Measurement of Time Series[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(2): 195-205.

参考文献

[1] HAN J. Data mining: concepts and techniques[M]. San Fran-cisco: Morgan Kaufmann Publishers Inc, 2005.
[2] PENG C K, HAVLIN S, STANLEY H E, et al. Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series[J]. Chaos, 1995, 5(1): 83-88．
[3] FU T C. A review on time series data mining[J]. Engineering Applications of Artificial Intelligence, 2011, 24(1): 164-181.
[4] RATANAMAHATANA C, KEOGH E, BAGNALL T, et al. A novel bit level time series representation with implications for similarity search and clustering[C]//Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Dis-covery and Data Mining. Berlin, Heidelberg: Springer, 2005: 771-777．
[5] KEOGH E J，LIN J，FU A W C. Hot SAX: efficiently finding the most unusual time series subsequence[C]//Proceedings of the 5th IEEE International Conference on Data Mining, Houston, Nov 27-30, 2005. Washington: IEEE Computer Society, 2005: 226-233.
[6] AGRAWAL R, FALOUTSOS C, SWAMI A N. Efficient simi-larity search in sequence database[C]//LNCS 730: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, Chicago, Oct 13-15, 1993. Berlin, Heidelberg: Springer, 1993: 69-84.
[7] KEOGH E J, CHAKRABARTI K, PAZZANI M J, et al. Dimensionality reduction for fast similarity search in large time series databases[J]. Knowledge and Information Systems, 2001, 3(3): 263-286.
[8] KEOGH E J, PAZZANI M J. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York, Aug 27-31, 1998. Menlo Park: AAAI, 1998: 239-243.
[9] CHAN K P, FU W C. Efficient timeseries matching by wave-lets[C]//Proceedings of the 15th International Conference on Data Engineering, Sydney, Mar 23-26, 1999. Piscataway: IEEE, 1999: 126-133.
[10] POPIVANOV I, MILLER R J. Similarity search overtime-series data using wavelets[C]//Proceedings of the 18th Inter-national Conference on Data Engineering, San Jose, Feb 26- Mar 1, 2002. Piscataway: IEEE, 2002: 212-221.
[11] CHUNG F L, FU T C, LUK R. Flexible time series pattern matching based on perceptually important points[C]//Pro-ceedings of the Workshop on Learning from Temporal and Spatial Data in International Joint Conference on Artificial Intelligence, Seattle, Aug 4-10, 2001: 1-7.
[12] JI H J, ZHOU C H, LIU Z F. An approximate representation method of time series symbols based on the beginning and end distance[J]. Computer Science, 2008, 45(6): 216-221.
季海娟, 周从华, 刘志锋. 一种基于始末距离的时间序列符号聚合近似表示方法[J]. 计算机科学, 2018, 45(6): 216-221.
[13] LIN J, KEOGH E J, WEI L, et al. Experiencing SAX: a novel symbolic representation of time series[J]. Data Mining Knowledge Discovery, 2007, 15(2): 107-144.
[14] LKHAGVA B, SUZUKI Y, KAWAGOE K. Extended SAX: extension of symbolic aggregate approximation for financial time series data representation[C]//Proceedings of the Data Engineering Workshop, 2006: 1-6.
[15] SHIEH J, KEOGH E. iSAX: disk-aware mining and indexing of massive time series datasets[J]. Data Mining and Knowledge Discovery, 2009, 19(1): 24-57.
[16] KORN F, JAGACIISH H V, FALOUTSOS C. Efficiently supporting ad hoc queries in large datasets of time sequences[C]//Proceedings of the ACM SIGMOD International Con-ference on Management of Data, Tucson, May 13-15, 1997. New York: ACM, 1997: 289-300.
[17] YE L X, KEOGH E. Time series shapelets: a new primitive for data mining[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, Jun 28-Jul 1, 2009. New York: ACM, 2009: 947-956.
[18] YE L X, KEOGH E. Time series shapelets: a novel technique that allows accurate, interpretable and fast classification[J]. Data Mining and Knowledge Discovery, 2011, 22(1/2): 149-182.
[19] SUN Y, LI J, LIU J, et al. An improvement of symbolic aggregate approximation distance measure for timeseries[J]. Neurocomputing, 2014, 138(11): 189-198.
[20] AZZOUZI M, NABNEY I T. Analysing time series structure with hidden Markov models[C]//Proceedings of the 1998 IEEE Signal Processing Society Workshop, Cambridge, Sep 2, 1998. Piscataway: IEEE, 1998: 402-408.
[21] KALPAKIS K, GADA D, PUTTAGUNTA V. Distance mea-sures for effective clustering of ARIMA time-series[C]//Pro-ceedings of the 2001 IEEE International Conference on Data Mining, San Jose, Nov 29-Dec 2, 2001. Washington: IEEE Computer Society, 2001: 273-280.
[22] NANOPOULOS A, ALCOCK R, MANOLOPOULOS Y. Feature-based classification of time-series data[J]. International Journal of Computer Research, 2001, 10: 49-61.
[23] LI A G, QIN Z. Dimensionality reduction and similarity search for large-scale time series data[J]. Chinese Journal of Com-puters, 2005, 28 (9): 1467-1475.
李爱国, 覃征. 大规模时间序列数据降维及相似搜索[J]. 计算机学报, 2005, 28( 9): 1467-1475.
[24] FUCHS E, GRUBER T, NITSCHKE J, et al. Online seg-mentation of time series based on polynomial least-squares approximations[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(12): 2232-2245．
[25] FUCHS E, GRUBER T, NITSCHKE J, et al. Temporal data mining using shape space representation of time series[J]. Neurocomputing, 2010, 74: 379-393.
[26] FUCHS E, GRUBER T, NITSCHKE J, et al. On-line motif detection in time series with SwiftMotif[J]. Patterns Recog-nition, 2009, 42(11): 3742-3750.
[27] SEBASTIANI P, RAMONI M, COHEN P R, et al. Discovering dynamics using Bayesian clustering[C]//LNCS 1642: Procee-dings of the 3rd International Symposium Advances in Inte-lligent Data Analysis, Amsterdam, Aug 9-11, 1999. Berlin, Heidelberg: Springer, 1999：199-209.
[28] HAR-PELED S, RAICHEL B. Net and prune: a linear time algorithm for Euclidean distance problems[C]//Proceedings of the Symposium on Theory of Computing Conference, Palo Alto, Jun 1-4, 2013. New York: ACM, 2014: 605-614.
[29] BAI S H, QI H D, XIU N H. Constrained best Euclidean distance embedding on a sphere: a matrix optimization approach[J]. SIAM Journal on Optimization, 2015, 25(1): 439-467.
[30] CHU S, KEOGH E J, HART D M, et al. Iterative deepening dynamic time warping for time series[C]//Proceedings of the 2nd SIAM International Conference on Data Mining, Arli-ngton, Apr 11-13, 2002. Philadelphia: SIAM, 2002: 195-212.
[31] FALOUTSOS C, RANGANATHAN M, MANOLOPOULOS Y. Fast subsequence matching in time-series databases[C]//Proceedings of the 1994 ACM SIGMOD International Con-ference on Management of Data, Minneapolis, May 24-27, 1994. New York: ACM, 1994: 419-429.
[32] YI B K, FALOUTSOS C. Fast time sequence indexing for arbitrary Lp norms[C]//Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Sep 10-14, 2000. San Mateo: Morgan Kaufmann, 2000: 385-394.
[33] SAKOE H, CHIBA S. A dynamic programming approach to continuous speech recognition[C]//Proceedings of the 7th Inter-national Congress on Acoustics, Budapest, 1971: 65-69.
[34] JOHN A, CHURCH G M. Aligning gene expression time series with time warping, algorithms[J]. Bioinformatics, 2001, 17(6): 495-508.
[35] MAO H B, WU H S, LI Z X, et al. Research on similarity measurement methods for multivariate time series[J]. Control and Decision, 2011, 26(4): 565-570.
毛红保, 吴虎胜, 李正欣, 等. 多元时间序列相似性度量方法研究[J]. 控制与决策, 2011, 26(4): 565-570.
[36] SAKOE H, CHIBA S. Dynamic programming algorithm opti-mization for spoken word recognition[M]//Waibel A, Lee K F. Readings in Speech Recognition. San Francisco: Morgan Kaufmann Publishers Inc, 1990.
[37] ITAKURA F. Minimum prediction residual principle applied to speech recognition[J]. IEEE Transactions on Acoustics Speech & Signal Processing, 1975, 23(1): 67-72.
[38] KEOGH E, RATANAMAHATANA C A. Exact indexing of dynamic time warping[J]. Knowledge and Information Systems, 2005, 7(3): 358-386.
[39] GORECKI T, LUCZAK M. Using derivatives in time series classification[J]. Data Mining and Knowledge Discovery, 2013, 26(26): 310-331.
[40] KEOGH E J, PAZZANI M J. Derivative dynamic time war-ping[C]//Proceedings of the 1st SIAM International Con-ference on Data Mining, Chicago, Apr 5-7, 2001. Philadelphia: SIAM, 2001: 1-11.
[41] MENG X J, WAN Y. Multivariate time series similarity mea-sure for dynamic time warping of adaptive cost[J]. Statistics and Decision, 2020, 36(2): 25-29.
孟晓静, 万源. 自适应代价动态时间弯曲的多元时间序列相似性度量[J]. 统计与决策, 2020, 36(2): 25-29.
[42] GOLAY X, KOLLIAS S, STOLL G, et al. A new corre-lation-based fuzzy logic clustering algorithm for FMRI[J]. Magnetic Resonance in Medicine, 1998, 40(2): 249-260.
[43] VLACHOS M, GUNOPULOS G, KOLLIOS G. Discover-ing similar multidimensional trajectories[C]//Proceedings of the 18th International Conference on Data Engineering, San Jose, Feb 26-Mar 1, 2002. Washington: IEEE Computer Society, 2002: 673-684.
[44] BANERJEE A, GHOSH J. Clickstream clustering using wei-ghted longest common subsequences[C]//Proceedings of the Workshop on Web Mining, SIAM Conference on Data Mining, Chicago. Philadelphia: SIAM, 2001: 33-40.
[45] WANG H Z, SU H, ZHENG K, et al. An effectiveness study on trajectory similarity measures[C]//Proceedings of the 24th Australasian Database Conference, Adelaide, 2013. Darlinghurst: Australia Computer Society, 2013: 13-22.
[46] BERGROTH L, HAKONEN H, RAITA T. A survey of longest common subsequence algorithms[C]//Proceedings of the 7th International Symposium on String Processing Information Retrieval, A Coru?a, Sep 27-29, 2000. Washington: IEEE Computer Society, 2000: 39-48.
[47] CHAIRUNNANDA P, GOPALKRISHNAN V, CHEN L. Enhancing edit distance on real sequences filters using his-togram distance on fixed reference ordering[C]//Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China, Aug 20-24, 2006. Washington: IEEE Computer Society, 2006: 582-585.
[48] CHEN L, ?ZSU M T, ORIA V. Robust and fast similarity search for moving object trajectories[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Jun 14-16, 2005. New York: ACM, 2005: 491-502.
[49] CHEN L, NG R T. On the marriage of Lp-norms and edit distance[C]//Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Aug 31-Sep 3, 2004. San Mateo: Morgan Kaufmann, 2004: 792-803.
[50] KURBALIJA V, RADOVANOVI M, GELER Z, et al. The influence of global constraints on similarity measures for time-series databases[J]. Knowledge-Based Systems, 2014, 56(3): 49-67.
[51] CONTI J C, FARIAL F A, ALMEIDA J, et al. Evaluation of time series distance functions in the task of detecting remote phenology patterns[C]//Proceedings of the 22nd Inter-national Conference on Pattern Recognition, Stockholm, Aug 24-28, 2014. Washington: IEEE Computer Society, 2014: 3126-3131.
[52] JIA D B, ZHANG D Y, LI N M. Pulse waveform classi-fication using support vector machine with Gaussian time warp edit distance kernel[J]. Computational and Mathematical Methods in Medicine, 2014: 1-10.
[53] SMYTH P, HECKERMAN D, JORDAN M I. Probabilistic independence networks for hidden Markov probability models[J]. Neural Computation, 1997, 9(2): 227-269.
[54] GE X P, SMYTH P. Deformable Markov model for time-series pattern matching[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Aug 20-23, 2000. New York: ACM, 2000: 81-90.
[55] PANUCCIO A, BICEGO M, MURINO V. A hidden Markov model-based approach to sequential data clustering[C]//LNCS 2396: Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition, Beijing, Aug 17-19, 2002. Berlin, Heidelberg: Springer, 2002: 734-742.
[56] SHENG H, ZHANG Y X. Network traffic modeling and forecasting based on ARIMA[J]. Communications Technology, 2019, 52(4): 903-907.
盛虎, 张玉雪. 基于ARIMA的网络流量建模及预测研究[J]. 通信技术, 2019, 52(4): 903-907.
[57] CHEN Y G, NASCIMENTO M A, OOI B C, et al. SpADe: on shape-based pattern detection in streaming time series[C]//Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Apr 15-20, 2007. Washington: IEEE Computer Society, 2007: 786-795.
[58] RODGERS L J, NICEWANDER W A. Thirteen ways to look at the correlation coefficient[J]. American Statistician, 1988, 42(1): 59-66.
[59] INDYK P, KOUDAS N, MUTHUKRISHNAN S. Identifying representative trends in massive time series data sets using sketches[C]//Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Sep 10-14, 2000. San Mateo: Morgan Kaufmann, 2000: 363-372.
[60] BAHJA F, MARTINO J, ELHAJ E I, et al. A corroborative study on improving pitch determination by time-frequency cepstrum decomposition using wavelets[J]. SpringerPlus, 2016, 5(1): 564.
[61] LI X H, ZHAN Y Z, KE J. Lens similarity measurement based on probability distance and fusion of spatiotemporal features[J]. Application Research of Computers, 2010, 27(4): 1526-1529.
李贤慧, 詹永照, 柯佳. 基于概率距离及融合时空特征的镜头相似性度量[J]. 计算机应用研究, 2010, 27(4): 1526-1529.
[62] KEOGH E J, LONARDI S, RATANAMAHATANA C A, et al. Compression-based data mining of sequential data[J]. Data Mining and Knowledge Discovery, 2007, 14(1): 99-129.
[63] AGHABOZORGI S R, SHIRKHORSHIDI A S, TEH Y W. Time-series clustering—a decade review[J]. Information Systems, 2015, 53(C): 16-38.
[64] LANG W, MORSE M D, PATEL J M. Dictionary-based compression for long time-series similarity[J]. IEEE Transa-ctions on Knowledge and Data Engineering, 2010, 22(11): 1609-1622.
[65] DAHLHAUS R. On the Kullback-Leibler information diver-gence of locally stationary processes[J]. Stochastic Processes and Their Applications, 1996, 62(1): 139-168.
[66] KEOGH E J, SMYTH P. A probabilistic approach to fast pattern matching in time series databases[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, Aug 14-17, 1997. Menlo Park: AAAI, 1997: 24-30.
[67] HUHTALA Y, K?RKK?INEN J, TOIVONEN H. Mining for similarities in aligned time series using wavelets[C]//Proceedings of the Data Mining and Knowledge Discovery: Theory, Tools, and Technology I, Orlando, Apr 5, 1999. San Francisco: SPIE, 1999: 150-160.
[68] WANG C Z, WANG X Y. Supporting content-based searches on time series via approximation[C]//Proceedings of the 12th International Conference on Scientific and Statistical Data-base Management, Berlin, Jul 26-28, 2000. Washington: IEEE Computer Society, 2000: 69-81.