Content of Database Technology in our journal

        Published in last 1 year |  In last 2 years |  In last 3 years |  All
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Dynamic Configurable Write-Ahead Logging Framework for Memory Table
    ZHU Haiming, HUANG Xiangdong, QIAO Jialin, WANG Jianmin
    Journal of Frontiers of Computer Science and Technology    2023, 17 (11): 2777-2783.   DOI: 10.3778/j.issn.1673-9418.2208103
    Normally, the NoSQL database management systems’ databases or data partitions fixedly write their write-ahead logging (WAL) into one or more log files after they are started up, forming a strong-coupled rela-tionship. Since the database logical model and partition configuration are determined by the application business and computing environments, with the write-ahead logging tightly coupled, the database management systems cannot optimize performance via simply configuring parameters such as the number and size of the write-ahead logging. In response to this problem, this paper proposes a dynamic configurable write-ahead logging framework for memory table. This framework records Redo log, and memory tables can be dynamically allocated to different write-ahead logging queues, supporting mutable relationship and decoupling write-ahead logging and applications. This paper implements this framework on the time series database Apache IoTDB, and relevant experiments are conducted. Experimental results show that, compared with strong-coupled write-ahead logging, this dynamic configurable write-ahead logging framework can find a better configuration and improve the write performance by 8% to 19%, indicating that this framework can achieve dynamic performance tuning for different computing environments and application loads.
    Reference | Related Articles | Metrics
    Abstract90
    PDF76
    Research on Asynchronous Global Index of Financial Distributed Database
    JIN Panshi, LI Bohan, QIN Xiaolin, XING Lei, LI Xiaodong, WANG Jin
    Journal of Frontiers of Computer Science and Technology    2023, 17 (11): 2784-2794.   DOI: 10.3778/j.issn.1673-9418.2208104
    With the rapid development of mobile payment, the traditional centralized database used by the  financial core business is confronted with challenges in terms of performance and availability. For this reason, some researchers propose a distributed database solution based on the architecture of separation of computing and storage, and build a physically decentralized and logically centralized distributed database management system through network connection. Global index is an important method for distributed database to improve query efficiency, but current global indexing is mainly achieved through the use of synchronization mechanisms in traditional networks.Under the typical transaction and batch import data scenarios of financial data management, this mechanism faces the problems that need to be solved urgently, such as small number of indices in a single table, reduced throughput, and jittery transaction response time. Therefore, an asynchronous global index architecture of distributed database is proposed. This mechanism is applied to adding MQ message queue and distributed cache, and uses RDMA network to realize an asynchronous global index to meet the demands of the real financial scenarios. Through sufficient comparative experiments with Oracle and CockroachDB, experimental results show that the performance is improved by more than 60% compared with the existing methods, and the demand for system resources is reduced by more than 20% in the batch import and transaction of financial core business data.
    Reference | Related Articles | Metrics
    Abstract162
    PDF120
    Evaluation for Instructional Interaction Using Bipartite Network Representation Learning
    WANG Xuecen, ZHANG Yu, ZHAO Changkuan, CHEN Mo, YU Ge
    Journal of Frontiers of Computer Science and Technology    2023, 17 (6): 1463-1472.   DOI: 10.3778/j.issn.1673-9418.2109109
    With the combination and development of “Internet plus Education”, online education has become an important teaching mode at present. Research shows that the interaction in online education provides effective help for learners. And the evaluation of interaction is the key to achieving high-quality online learning. The interaction between learners and learning resources in online education builds a bipartite interactive network, and network representation learning technology is a powerful tool for network modeling and prediction. Based on the above analysis, an evaluation algorithm based on bipartite interactive network representation learning (EABINRL) is proposed. This algorithm combines the topological structure information of the bipartite interactive network and the interactive information between nodes, and the aim of this algorithm is to learn the low-dimensional vector representations of two types of nodes by modeling the explicit interaction behavior and the implicit interaction behavior. For different interaction types, different weights are used. Then the model is further optimized and the interactive evaluation results are obtained through Frobenius norm calculation. The results of the learner state prediction experiments conducted on the real public datasets prove the effectiveness of this algorithm.
    Related Articles | Metrics
    Abstract165
    PDF121
    Sequence Recommendation with Dual Channel Heterogeneous Graph Neural Network
    WU Jinchen, YANG Xingyao, YU Jiong, LI Ziyang, HUANG Shanhang, SUN Xinjie
    Journal of Frontiers of Computer Science and Technology    2023, 17 (6): 1473-1486.   DOI: 10.3778/j.issn.1673-9418.2205053
    The purpose of recommendation system based on user behavior sequence is to predict user??s next click according to the order of last sequence. The current research is generally based on the conversion of items in the user behavior sequence to understand user preferences. However, other valid information in the behavior sequence is ignored, such as the user profile, which results in the model failing to understand user??s specific preferences. In this paper, a user behavior sequence recommendation with dual channel heterogeneous graph neural network (DC-HetGNN) is proposed. The method uses a heterogeneous graph neural network channel and a heterogeneous graph line channel to learn behavior sequence embedding and capture the specific preferences of users. DC-HetGNN constructs heterogeneous graphs containing various types of nodes based on behavior sequences that capture dependencies between projects, users, and sequences. Then, the heterogeneous graph neural network channel and the heterogeneous graph line channel capture the complex transformation of items and the interaction between the sequences, and learn the embedding of items containing user information. Finally, considering the influence of users’ long-term and short-term preferences, local and global sequence embedding is combined with attention network to obtain the final sequence embedding. A large number of experiments conducted on Diginetica and Tmall, two real e-commerce user behavior sequence datasets, show that compared with recent model FGNN, DC-HetGNN is improved by 2.08% and 0.78% on average in performance criterions mean reciprocal rank (MRR) and Recall, respectively, and by 2.70% and 0.49% in performance criterions MRR@n and Recall@n, respectively, compared with recent model TGSRec.
    Reference | Related Articles | Metrics
    Abstract264
    PDF318
    RDMA Optimization Technology for Two-Phase Locking Concurrency Control
    LI Jingyao, ZHANG Qian, ZHAO Zhanhao, LU Wei, ZHANG Xiao, DU Xiaoyong
    Journal of Frontiers of Computer Science and Technology    2023, 17 (5): 1201-1209.   DOI: 10.3778/j.issn.1673-9418.2107032
    Performance optimization of distributed transactions is one of the hottest topics in academic and industry community. Concurrency control technology based on two-phase locking can guarantee the correctness of concurrent transaction scheduling, thus it is widely implemented in mainstream commercial and open source distributed data-bases. However, existing works show that the performance bottleneck of distributed transaction processing tech-nology, based on traditional TCP/IP protocol and Share-Nothing architecture, comes from low CPU utilization of transaction scheduler and high network latency between transaction scheduler and storage nodes. To address the above two problems, an optimization technology of two-phase locking (2PL) concurrency control based on new hardware RDMA (remote direct memory access) is proposed to improve the performance of distributed transactions by using RDMA??s characteristics, e.g., high bandwidth, low latency and kernel bypass (eliminating the CPU overhead caused by TCP/IP protocol stack). The main technical contributions of this paper include the rewriting and opti-mization of network communication operators based on RDMA, and the atomicity guarantee when using RDMA one-sidedly to acquire and release read-write locks. Experimental results based on YCSB benchmark show that: one-sided mutex lock algorithm and one-sided mutex/shared lock algorithm have relative advantages under low and high contention workload respectively; 2PL concurrency control protocols with RDMA achieve up to 5.3x and 10.6x performance gain for NO WAIT and WAIT DIE respectively under high contention workload.
    Reference | Related Articles | Metrics
    Abstract284
    PDF162
    SQL-Detector: SQL Plagiarism Detection Technique Based on Coding Features
    XU Jia, MO Xiaokun, YU Ge, LYU Pin, WEI Tingting
    Journal of Frontiers of Computer Science and Technology    2022, 16 (9): 2030-2040.   DOI: 10.3778/j.issn.1673-9418.2103011

    Mastering structured query language (SQL) is the key to learn the database technology. However, a lot of teaching practices show that some students may plagiarize when doing SQL exercises. Existing SQL plagiarism detection techniques either detect plagiarized submissions simply by matching the similarities of students' SQL submissions, or identify plagiarism problems by analyzing students' SQL submissions based on the simple coding features displayed in students' SQL codes, which fails to make good use of the rich coding features of students when they write SQL codes to achieve high-accuracy plagiarism detection. In view of this, this paper proposes an SQL plagiarism detection technique based on coding features of students, named SQL-Detector. SQL-Detector first extracts both of the exercise coding features of students for specific SQL exercises and the exercise generalization coding features of students based on their coding habits, so as to profile the students. Then, SQL-Detector identifies the plagiarism group by conducting a clustering analysis over the exercise coding features of all students. Finally, SQL-Detector determines the copiers and givers by comparing the consistency between students' exercise generalization coding features and his/her historical generalization coding features. Experimental evaluation is conducted by using the dataset collected from real classroom practices. Experimental results show that the plagiarism detection accuracy of the SQL-Detector technique is on average 14.0% higher than that of the state-of-the-art technique.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract336
    PDF163
    HTML9
    Optimized Number of Reverse Neighbor Clustering Algorithm by Voronoi Diagram in Obstacle Space
    HE Yunbin, LIU Wanxu, WAN Jing
    Journal of Frontiers of Computer Science and Technology    2022, 16 (9): 2041-2049.   DOI: 10.3778/j.issn.1673-9418.2102013

    In order to solve the problem that the existing obstacle space clustering algorithm needs to select the clustering center and set the threshold value manually, OBRK-means (obstacle based on nearest K-means) clus-tering algorithm based on Voronoi diagram is proposed. The algorithm is discussed and analyzed from three aspects: the selection of cluster center, the selection of outliers and the generalized coverage circle. Firstly, Voronoi diagram is introduced to calculate the reverse nearest neighbor number to determine the cluster center. Secondly, Voronoi diagram and density of sample points are used to screen and prune outliers in the dataset. Finally, the generalized covering circle is introduced to carry out the initial clustering, and the inner and outer boundary points are proposed to solve the problem that the initial clustering results are not accurate. The exclusion points and expansion points are calculated respectively from the inner and outer boundary points to improve the accuracy of clustering. Theoretical research and experimental results show that the algorithm has higher efficiency in processing data in obstacle space and gets better clustering results.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract421
    PDF103
    HTML8
    Improved Parallel Random Forest Algorithm Combining Information Theory and Norm
    MAO Yimin, GENG Junhao
    Journal of Frontiers of Computer Science and Technology    2022, 16 (5): 1064-1075.   DOI: 10.3778/j.issn.1673-9418.2010064

    Aiming at the problems of excessive redundancy and irrelevant features, low training feature information and low parallelization efficiency in big data random forest algorithm based on MapReduce, this paper proposes a parallel random forest algorithm based on information theory and norm (PRFITN). Firstly, the algorithm designs the DRIGFN (dimension reduction based on information gain and Frobenius norm) strategy to reduce the number of redundant and irrelevant features. Secondly, a feature grouping strategy based on information theory (FGSIT) is proposed. According to the FGSIT strategy, the features are grouped, and the stratified sampling method is adopted to ensure the information amount of the training features when constructing the decision tree in the random forest. Accuracy of classification results is improved. Finally, in order to improve the parallel efficiency of the cluster, the redistribution of key-value pairs (RSKP) is presented to realize the rapid and uniform distribution of key-value pairs, and obtain the global classification results. Experimental results show that the algorithm has better classification effect in big data environment, especially for datasets with more features.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract200
    PDF63
    HTML11
    Top- k Average Utility Co-location Pattern Mining of Fuzzy Features
    LI Jinhong, WANG Lizhen, ZHOU Lihua
    Journal of Frontiers of Computer Science and Technology    2022, 16 (5): 1053-1063.   DOI: 10.3778/j.issn.1673-9418.2011003

    The spatial co-location pattern refers to a subset of non-empty spatial features whose instances are frequently located together in a spatial neighborhood. Researchers have carried out relevant research of top-k spatial co-location pattern mining for deterministic data and uncertain data, but there is no research on top-k average utility co-location pattern mining for fuzzy features. Therefore, this paper proposes top-k average utility co-location pattern mining for fuzzy features. Firstly, the relevant concepts of top-k average utility co-location patterns of fuzzy features are defined, and the “downward close” nature of the extended fuzzy average utility of the pattern is analyzed. Secondly, an algorithm of mining top-k average utility co-location patterns based on extended fuzzy average utility value is designed,solving the problem that the fuzzy average utility does not satisfy the “downward close” nature. Thirdly, a pruning method based on a locally extended fuzzy average utility is proposed, which effectively reduces the search space for top-k average utility co-location pattern mining, and further improves the efficiency of the mining algorithm. Finally, the practicability, efficiency and robustness of the proposed algorithm are verified on real and synthetic datasets.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract196
    PDF66
    HTML5
    Research on User Similarity Calculation of Collaborative Filtering for Sparse Data
    WU Sen, DONG Yaxian, WEI Guiying, GAO Xiaonan
    Journal of Frontiers of Computer Science and Technology    2022, 16 (5): 1043-1052.   DOI: 10.3778/j.issn.1673-9418.2011062

    User-based collaborative filtering achieves recommendation for target users based on the preferences of their nearest neighbors, in which how to calculate user similarity is critical. The traditional rating similarity calculation relies on the scores of common scoring items. With the intensification of the sparsity of user-item scoring matrix, traditional rating similarity calculation is difficult to accurately measure the similarity between users. Along this line, traditional rating similarity calculation is difficult in selecting reliable nearest neighbors for the target user, which affects the final recommendation performance. Besides, structural similarity is another commonly used similarity calculation method in recommendation task, which is mostly measured by the proportion of users’ common scoring items. This kind of method is easy to calculate and less affected by data sparseness. However, its outputs are usually close, leading to the result that different user-pairs cannot be distinguished obviously. To solve the similarity calculation difficulty for collaborative filtering caused by data sparseness, a sparse cosine similarity is proposed in this paper. Firstly, this paper formulates a new structural similarity, sparse set simil-arity to differentiate users into two groups, high-correlation users and low-correlation users. Then, this paper deve-lops different rating similarity calculation methods for different kinds of users, which can eliminate the misleading produced by traditional rating similarity when the data is sparse. Finally, the sparse cosine similarity is constructed by combining the raised rating similarity and structural similarity. Experimental results show that compared with seven similarity calculation methods, the presented sparse cosine similarity can yield more accurate user similarity and improve the performance of recommendation task, overcoming the limitations that traditional rating methods are affected by data sparseness severely and the results produced by structural methods are not distinct significantly.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract250
    PDF109
    HTML12
    Workload Characterization of Online and Offline Services in Co-located Data Centers
    CHEN Shenglei, QIU Yitao, JIANG Congfeng, ZHANG Jilin, YU Jun, LIN Jiangbin, YAN Longchuan, REN Zujie, WAN Jian
    Journal of Frontiers of Computer Science and Technology    2022, 16 (4): 822-834.   DOI: 10.3778/j.issn.1673-9418.2009098

    In order to reduce cost, energy consumption and improve the utilization of cloud data center resources, many cloud data centers currently use a co-allocated pattern of online services and offline batch workload. Though the co-allocated approach can bring many benefits to the data center, it adds complexity to task scheduling and brings a range of challenges such as high reliability and low latency. This paper delves into the operation of all online services and offline batch workload for the Alibaba Data Center 4034 server cluster for a period of 8 days. From the results of the data analysis, following conclusions are drawn. Firstly, from the perspective of the operation of online service, the average CPU utilization of all containers has a cyclical change, which is maintained at a high level from 8:00 am to 9:00 pm every day, and falls back to the lowest point at 4 am every day. Secondly, for offline tasks, except the first and the eighth day, the peaks of task submissions for the remaining six days are concentrated at the same time each day. The running time of 95% of the instances is within 199 s, but there are 0.052% of the instances with running time of more than one hour or even a few days. Thirdly, for the application-related situation, there are large differences in the number of containers deployed in different applications. One application uses up to 629 containers and at least 1 container. Finally, cluster analysis is conducted on servers, online tasks and batch instances. Containers with relatively high resource utilization account for the vast majority of all containers, while instances with low resource utilization and short execution time account for the vast majority of all instances. The findings and recommendations in this paper can help data center managers understand the characteristics of co-located workloads more detailedly, thereby improving resource utilization and fault tolerance for each task.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract504
    PDF266
    HTML29
    Mining Spatial Prevalent Co-location Patterns Based on Graph Databases
    HU Zisong, WANG Lizhen, Vanha Tran, ZHOU Lihua
    Journal of Frontiers of Computer Science and Technology    2022, 16 (4): 806-821.   DOI: 10.3778/j.issn.1673-9418.2010015

    A spatial prevalent co-location pattern (SPCP) is a subset of spatial features whose instances frequently appear together in geographic space. Memory-based neighbor relationship materialization method to search for pattern’s instances is efficient, but instance information is stored repeatedly. Graph database technology can efficiently model data with complex associations. Thus, it is possible to consider using the graph database technology to materialize neighbor relationships (i.e., to construct neighborhood graph), but directly transplanting existing mining methods cannot exert the advantages of graph traversal. To solve the above problem, this paper explores the graph database-based approach to mine spatial prevalent co-location patterns. Firstly, the graph database is utilized to model the spatial instances and their neighbor relations, i.e., the instances and relations are stored in the graph database to construct the neighborhood graph. Then, a basic algorithm called subgraph (or clique) search is designed based on the graph database, using the way of clique search strategy to generate a pattern’s table instance to obtain the participating instances, and avoid the inefficient combination or join operations in the traditional method. Considering the low efficiency of collecting participating instances by generating table instances, a participating instance verification algorithm is designed, including the filtering and verification phases. The filtering phase determines whether the features involved in the center instance’s neighborhoods fully contain the features in the pattern, and the verification phase determines whether there is pattern instance containing the central instance. The participating instance verification algorithm determines as many participating instances as possible each time, thereby effectively reducing the search space and the number of clique searches. In addition, the correctness and completeness of the proposed algorithms are proven. Finally, extensive experiments are conducted on real and synthetic datasets to verify the efficiency and effectiveness of the proposed algorithms.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract394
    PDF624
    HTML95
    Groups Nearest Neighbor Query of Mixed Data in Spatial Database
    JIANG Yiying, ZHANG Liping, JIN Feihu, HAO Xiaohong
    Journal of Frontiers of Computer Science and Technology    2022, 16 (2): 348-358.   DOI: 10.3778/j.issn.1673-9418.2009032

    The existing group nearest neighbor query methods mainly abstract data objects in space as points or line segments for processing. However, in real applications, simply abstracting spatial objects into points or line segments often affects the accuracy and efficiency of the query. In view of the shortcomings that the existing group nearest neighbor query method cannot directly and effectively deal with the group nearest neighbor query of the mixed data, the group nearest neighbor query method of the mixed data in the spatial database is proposed in this paper. Firstly, the concept and properties of the mixed data Voronoi diagram are proposed. Then the mixed data set is pruned based on the mixed data Voronoi diagram. The corresponding pruning algorithm is given for the case that the number of query objects is 1 and the number of query objects is greater than 1. The proposed pruning algorithm can effectively remove the impossible resultant data objects and get the candidate set. In the refining process, a corresponding distance calculation method is given according to the position relationship between data objects, and the correct query result is finally obtained by comparing the sum of the distance between the data object in the candidate set and each query object. Theoretical research and experiments show that the proposed algorithm in this paper can accurately and effectively deal with the group nearest neighbor query problem of mixed data.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract302
    PDF167
    HTML9
    Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment
    ZHAO Hengtai, ZHAO Yuhai, YUAN Ye, JI Hangxu, QIAO Baiyou, WANG Guoren
    Journal of Frontiers of Computer Science and Technology    2022, 16 (2): 337-347.   DOI: 10.3778/j.issn.1673-9418.2009100

    The large-scale dimension table connection technology in the distributed environment is one of the key technologies in online big data analysis, which is widely used in real-time recommendation, real-time analysis and other fields. The dimension table connection indicates that stream data and dimension tables stored offline will be connected to be processed accordingly. Firstly, this paper studies the existing dimension table connection technology and surveys the design of relevant optimization technologies and mainstream distributed engines. The traditional way of improving performance is optimizing dimension table data query. Traditional optimization is limited to the scale of the dimension table and data stream rate. Secondly, in terms of the inefficient usage of existent optimization technologies’ consideration for the whole cluster in distributed environment, this paper puts forward a computing model suitable for hybrid calculation of offline batch data and real-time stream data. This paper proposes a method of dimension table associated data cache, which reads dimension table data from a single node and distributes and calculates the data after it is segmented. This paper also optimizes the computing logic of dimension table connection so that a higher-level scale of the dimension table is applied, and the data connection limitation is overcome. Finally, the dimension table connection technology in this paper and the traditional dimension table connection technology have been implemented in Apache Flink. The optimization for dimension table connection of distributed stream computing in this paper has been verified via the experiment of comparing throughput and latency based on dataset from Double 11 Shopping Carnival of Alibaba Group.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract313
    PDF105
    HTML8
    Hybrid Parallel Frequent Itemsets Mining Algorithm by Using N-List Structure
    LIU Weiming, ZHANG Chi, MAO Yimin
    Journal of Frontiers of Computer Science and Technology    2022, 16 (1): 120-136.   DOI: 10.3778/j.issn.1673-9418.2008068

    Aiming at the problem of unbalanced load, bad efficiency of N-list merge and redundant search for each node based on parallel frequent itemset mining algorithm MRPrePost (parallel PrePost algorithm based on MapReduce), this paper proposes a hybrid parallel frequent itemset mining algorithm based on N-list (HP-FIMBN). Firstly, a load estimation function (LE) is designed to calculate the load of each item in F-list, and a grouping method based on greedy strategy (GM-GS) is proposed, which not only deals with the problem of unbalanced load in the process of data partitioning, but also decreases the size of sub-PPC-tree on local node. Secondly, in order to accelerate the completion of the merging of two N-list structures, it designs an early abandon strategy (EAS), which can not only efficiently avoid invalid calculations in the merging process, but also does not need to traverse the initial N-list structure to get the final result. Finally, the set-enumeration tree is adopted as the search space, a superset equivalent strategy (SES) is proposed to avoid redundant search in the mining process and the final mining results are generated. Experimental results show that the modified algorithm has better performance on mining frequent itemsets in big data environment.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract305
    PDF245
    HTML8
    Optimization Method of Projection and Order for Multiple Tables Join
    ZONG Fengbo, ZHAO Yuhai, WANG Guoren, JI Hangxu
    Journal of Frontiers of Computer Science and Technology    2022, 16 (1): 106-119.   DOI: 10.3778/j.issn.1673-9418.2009099

    Multiple tables join operation is a common operation in big data processing. Similar to the common Join operations in database operations, the order of multiple tables join operation will have a great impact on the consumption of computing resources and transmission resources. The optimization of the join order of multiple tables is a classical optimization problem, and the size of the projection result of the table in each join will also affect the data volume transmitted between nodes. Therefore, the overall connection order and the projection relationship of each connection will have a significant impact on the join efficiency. But in the traditional optimiza-tion strategy, the choice of intermediate projection relation, and the influence on the optimal join strategy based on the intermediate projection relation are often not considered. In order to solve this problem, this paper establishes a connection relation index, which can adjust the projection relation of each join in the construction optimization connection strategy, delete redundant columns in time, and reduce the consumption of transmission resources. At the same time, the optimization strategy of adjusting join order based on projection relation can reduce the consumption of transmission resources and computing resources as much as possible. After the implementation in the Flink system, the optimization strategy is tested, and the results show that it has a significant optimization effect.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract312
    PDF149
    HTML12
    Correlation-Based Method for Tracing Multi-dimensional Time Series Data Anomalies
    WANG Muxian, DING Xiaoou, WANG Hongzhi, LI Jianzhong
    Journal of Frontiers of Computer Science and Technology    2021, 15 (11): 2142-2150.   DOI: 10.3778/j.issn.1673-9418.2008100

    This paper proposes a multi-dimensional time series anomaly data detection method based on correlation analysis, to trace the cause of anomaly detection: system failure data and sensor quality problem data are classified, and then real system failures are identified to avoid false detection. Firstly, the time series correlation graph model is proposed, which is further summarized as the time series correlation loop model. The time series correlation set is obtained by extracting the features in the time series correlation cycle, the cause of abnormality is detected, and the system failure is judged according to the result. Through a large number of experiments on real industrial data sets, the effectiveness of the method in the detection of abnormal sources of high-dimensional time series data is verified. Through comparative experiments, it is verified that the method is superior to fundamental algorithms based on statistics and machine learning models in terms of stability and efficiency. The higher dimensionality of time series, the more obvious improvement of the method compared with the fundamental algorithms. This method not only saves the cost, but also realizes the accurate identification of multi-dimensional abnormal data.

    Reference | Related Articles | Metrics
    Abstract301
    PDF406
    Geo-Socially Tenuous Group Query
    LI Na, ZHU Huaijie, LIU Wei, YIN Jian
    Journal of Frontiers of Computer Science and Technology    2021, 15 (11): 2151-2160.   DOI: 10.3778/j.issn.1673-9418.2008099

    Compared with the dense group/subgraph, there are few studies on tenuous groups. Although the existing work has begun to study the tenuous population query, geo-socially tenuous group query has not been studied, and location-based services have a lot of demands in real life. Therefore, it becomes valuable to study the geo-socially tenuous group query. Geo-socially tenuous group query is to find a group of users, which not only satisfies a certain sparsity between users (i.e. the social distance between users is greater than [k]), but also minimizes the distance between users and the query location. To address this problem, this paper first proposes a basic processing algorithm based on c-neighbor (baseline), which uses stored c-neighbor information and distance pruning to help obtain query results quickly. However, the basic processing algorithm based on c-neighbor (baseline) uses too much space and the query efficiency is not high when parameter [k>c]. To solve these problems, a query optimization algorithm based on c-neighbor and reverse c-neighbor (ICN) is proposed, which not only utilizes stored c-neighbor information but also reverse c-neighbor information to effectively filter out invalid users and obtain query results quickly. The experimental results and theory show that the proposed two query processing methods are effective and correct.

    Reference | Related Articles | Metrics
    Abstract212
    PDF171
    Optimized Density Peak Clustering Algorithm by Natural Reverse Nearest Neighbor
    LIU Juan, WAN Jing
    Journal of Frontiers of Computer Science and Technology    2021, 15 (10): 1888-1899.   DOI: 10.3778/j.issn.1673-9418.2007017

    The density peak clustering algorithm is a density based clustering algorithm. The shortcomings of the density peak clustering algorithm are sensitive to parameters and poor clustering results on complex manifold data sets. A novel density peak clustering algorithm is proposed in this paper, which is based on the natural reverse nearest neighbor structure. First of all, reverse nearest neighbor is introduced to calculate the local density of data objects. Then, the initial cluster centers are selected by combining the representative points and the density. Furthermore, the density adaptive distance is used to calculate the distance between the initial cluster centers, the decision graph is constructed on the initial cluster centers by using the local density calculated based on reverse nearest neighbor and the density adaptive distance, and the final cluster centers are selected according to the decision graph. Finally, the remaining data objects are assigned to the same cluster as their nearest initial cluster centers belong to. The experimental results show that the algorithm has better clustering effect and accuracy compared with the experimental comparison algorithms on the synthetic data sets and UCI real data sets, and it has greater advantages in dealing with complex manifold data sets.

    Related Articles | Metrics
    Abstract269
    PDF246
    Research on Increased Data Repair with Confidence Value Token
    HUANG Hui, LI Hailin
    Journal of Frontiers of Computer Science and Technology    2021, 15 (10): 1900-1911.   DOI: 10.3778/j.issn.1673-9418.2006079

    In the era of big data, data contain great value and become important strategic resource in today??s information society. However, a large number of inconsistent data occur during the process of data update and management, which causes unpredictable side effects for enterprises. There are three repair methods based on functional dependencies. The first two methods strongly rely on the Master data or confidence value of given tuples provided by enterprises, which are hard to fulfill in real application. And the third kind of repair method based on the minimal deletion principle will cause the loss of information. Moreover, when solving the conflicts of [X→Y], existing methods only support modifying Y attribute. In view of the shortcomings mentioned above, with the situation of missing tuple confidence, this paper proposes an increased data repair with confidence value token, which can be divided into two parts: the first part is to generate confidence value token automatically by analyzing operator log and knowledge rules, and the second part includes an increased repair strategy which can determine the repair of X or Y attributes according to the confidence value token. Meanwhile, the target value is chosen to repair dirty data with the combination of conditional probability. Experimental results show that the proposed method has high reliability and scalability.

    Reference | Related Articles | Metrics
    Abstract163
    PDF259