计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (7): 1055-1063.DOI: 10.3778/j.issn.1673-9418.1705036

• 学术研究 • 上一篇    下一篇

基于Spark的时间序列并行分解模型

李勇,滕飞,黄齐川,李天瑞   

  1. 1. 西南交通大学 信息科学与技术学院,成都 611756
    2. 轨道交通工程信息化国家重点实验室(中铁第一勘察设计院),西安 710043
  • 出版日期:2018-07-01 发布日期:2018-07-06

Parallel Time Series Decomposition Algorithm Based on Spark

LI Yong, TENG Fei, HUANG Qichuan, LI Tianrui   

  1. 1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
    2. State Key Laboratory of Rail Transit Engineering Informatization (China Railway First Survey and Design Institute), Xi'an 710043, China
  • Online:2018-07-01 Published:2018-07-06

摘要:

为了应对大数据时代下的时间序列分解问题,基于分布式内存计算框架Spark,提出了一种并行的时间序列分解模型。模型首先将完整的时间序列切分成一系列的时间子序列,通过对时间子序列两端冗余数据的方式保护内部数据免受端点数据污染;然后将带有冗余的时间子序列分发给Spark集群的计算节点,每个节点使用时间序列分解算法对时间子序列进行处理;最后去除分解结果的冗余部分,再将其合并。针对模型实例进行实验,结果证明了该模型可以高效准确地分析大规模时间序列。

关键词: 时间序列分解, Spark, 云计算, 并行计算, STL, SSA

Abstract:

This paper proposes a parallel time series decomposition model based on Spark to handle the challenges of time series decomposition in the era of big data. Specifically, the parallel algorithm consists of three steps: Firstly,  a time series is split into a sequence of sub-time series while adding some redundant data into the sub-time series to protect these sub-time series being polluted. Secondly, each segmented sub-time series (with redundant information) is transferred to one work node on Spark for analyzing by using some time series decomposition algorithms. Thirdly, the redundant parts in the analysis result of each sub-time series are removed and the results are integrated. Experiments are conducted to evaluate the model, the results demonstrate its effectiveness and accuracy on large-scale time series datasets.

Key words: time series decomposition, Spark, cloud computing, parallel computing, STL, SSA