Content of High Performance Computing in our journal

        Published in last 1 year |  In last 2 years |  In last 3 years |  All
    Please wait a minute...
    For Selected: Toggle Thumbnails
    TEB: Efficient SpMV Storage Format for Matrix Decomposition and Reconstruction on GPU
    WANG Yuhua, ZHANG Yuqi, HE Junfei, XU Yuezhu, CUI Huanyu
    Journal of Frontiers of Computer Science and Technology    2024, 18 (4): 1094-1108.   DOI: 10.3778/j.issn.1673-9418.2304039
    Sparse matrix-vector multiplication (SpMV) is a crucial computing process in the field of science and engineering. CSR (compressed sparse row) format is one of the most commonly used storage formats for sparse matrix. In the process of implementing parallel SpMV on the graphics processing unit (GPU), it only stores non-zero elements of sparse matrix, avoiding computational redundancy caused by zero element filling, and saving storage space. But there is a problem of load imbalance, which wastes computing resources. To address the aforementioned issues, storage formats with good performance in recent years have been studied, and a row by row decomposition and reorganization storage format—TEB (threshold-exchangeorder block) format has been proposed. The format first uses a heuristic threshold selection algorithm to determine the appropriate segmentation threshold, and combines the row merging algorithm based on reordering to reconstruct and decompose the sparse matrix, so that the number of non-zero elements between blocks is as close as possible. Furthermore, combined with CUDA (computer unified device architecture) thread technology, a parallel SpMV algorithm between sub blocks based on TEB storage format is proposed, which can reasonably allocate computing resources and solve the problem of load imbalance, thus improving the parallel computing efficiency of SpMV. In order to verify the effectiveness of the TEB storage format, experiments are conducted on the NVIDIA Tesla V100 platform. The results show that compared to PBC (partition-block-CSR), AMF-CSR (adaptive multi-row folding of CSR), CSR-Scalar (compressed sparse row-scalar), and CSR5 (compressed sparse row 5) storage formats, TEB can improve SpMV time performance by an average of 3.23×, 5.83×, 2.33×, and 2.21×. In terms of floating-point computing performance, the average improvement can be 3.36×, 5.95×, 2.29×, and 2.13×
    Reference | Related Articles | Metrics
    Abstract70
    PDF76
    Research on Method of Log Pattern Extracting in High-Performance Computing Environment
    WANG Xiaodong, ZHAO Yining, XIAO Haili, WANG Xiaoning, CHI Xuebin
    Journal of Frontiers of Computer Science and Technology    2022, 16 (10): 2264-2272.   DOI: 10.3778/j.issn.1673-9418.2103066

    Log analysis plays an important role in the stable operation of computer system. However, logs are usua-lly unstructured, which is not conducive to automatic analysis. How to categorize logs and turn them into structured data automatically is of great practical significance. In this paper, LDmatch algorithm is proposed, which imple-ments a log pattern extracting algorithm based on word matching rate. Traditional log matching algorithms use one-to-one word matching method in similarity calculation, while the proposed LDmatch algorithm calculates the simi-larity between logs according to the longest common subsequence (LCS) of words contained in two logs, and classi-fies logs based on the LCS. LDmatch algorithm can also get real-time log template and update. In addition, the pat-tern warehouse of the algorithm uses a data structure based on hash table for storage, which refines the classification of logs and reduces the times of comparison during log matching, thus improving the matching efficiency of the algorithm. In order to verify the advantages of the algorithm, it is applied to the open source data set and the actual log data set generated by the CNGrid. A variety of other log pattern extraction algorithms are used for comparison and experimental results are obtained. Finally, the advantages of the algorithm in accuracy, robustness and efficiency are proven.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract299
    PDF287
    HTML9
    GPU-Oriented Parallel Algorithm for Histogram Statistical Image Enhancement
    XIAO Han, SUN Lupeng, LI Cailin, ZHOU Qinglei
    Journal of Frontiers of Computer Science and Technology    2022, 16 (10): 2273-2285.   DOI: 10.3778/j.issn.1673-9418.2103059

    Histogram statistics has important applications in the fields of image enhancement and target detection. However, with the increasing size of the image and the higher real-time requirements, the processing process of the histogram statistical local enhancement algorithm is slow and cannot reach the expected satisfactory speed. In view of this deficiency, this paper realizes the parallel processing of histogram statistical image enhancement algorithm on graphics processing unit (GPU) platform, which improves the processing speed of large format digital images. Firstly, the efficiency of data access is improved by making full use of compute unified device architecture (CUDA) active thread block and active thread to process different sub-image blocks and pixels in parallel. Then, the paralle-lization of histogram statistical image enhancement algorithm on GPU platform is realized by using kernel configu-ration parameter optimization and data parallel computing technology. Finally, the efficient data transmission mode between the host and the device is adopted, which further shortens the execution time of the system on the hetero-geneous computing platform. The results show that for images with different image sizes, the processing speed of the image histogram statistical parallel algorithm is two orders of magnitude higher than that of the CPU serial algorithm. It takes 787.11 ms to process an image with an image size of 3241×3685. The processing speed of the parallel algo-rithm is increased by 261.35 times. It lays a good foundation for the realization of real-time large-scale image processing.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract317
    PDF416
    HTML11
    Parallel Architecture Design for OpenVX Kernel Image Processing Functions
    PAN Fengrui, LI Tao, XING Lidong, ZHANG Haocong, WU Guanzhong
    Journal of Frontiers of Computer Science and Technology    2022, 16 (7): 1570-1582.   DOI: 10.3778/j.issn.1673-9418.2012085

    Although the traditional programmable processors are highly flexible, their processing speed and perfor-mance are inferior to the application specific integrated circuit (ASIC). Image processing is often a diverse, intensive and repetitive operation, so the processor must balance speed, performance and flexibility. OpenVX is an open source standard for preprocessing or auxiliary processing of image processing, graph computing and deep learning applications. Aiming at the kernel visual function library of OpenVX 1.3 standard, this paper designs and implements a programmable and extensible OpenVX parallel processor. The architecture adopts an application specific instruction processor (ASIP). After analyzing and comparing the topological characteristics of various interconnection networks, the backbone of the ASIP chooses the hierarchically cross-connected Mesh+ (HCCM+) with outstanding performance, and processing element (PE) is set at network nodes. PE array is constructed to support dynamic configuration, and a parallel processor is designed to realize programmable image processing based on efficient routing and com-munication. The proposed architecture is suitable for data parallel computing and emerging graph computing. The two computing modes can be configured separately or mixed. The kernel visual function and graph computing model are mapped to the parallel processor respectively to verify the two modes and compare the image processing speed under different PE numbers. The results show that OpenVX parallel processor can complete the mapping and linear speedup of kernel functions and high complexity graph calculation model. The average speedup of scheduling 16 PEs to various functions is approximately 15.0375. When implemented on an FPGA board with a 20 nm XCVU440 device, the prototype can run at a frequency of 125 MHz.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract271
    PDF580
    HTML49
    Parallel Implementation of OpenVX Feature Extraction Functions in Programmable Processing Architecture
    ZHANG Haocong, LI Tao, XING Lidong, PAN Fengrui
    Journal of Frontiers of Computer Science and Technology    2022, 16 (7): 1583-1593.   DOI: 10.3778/j.issn.1673-9418.2012080

    Aiming at the mass computing and slow speed of serial structure calculation of digital image processing, parallel implementation of underlying feature extraction kernel functions in the latest open source OpenVX specification 1.3 is completed, and the verification is carried out with the self-designed OpenVX programmable parallel processor. In the underlying feature extraction of the image, the basic pixel processing function Color Convert, the local image processing functions Gaussian Filter and Median Filter of OpenVX specification 1.3 are selected for filtering and smoothing. Harris Corners and Canny Edge Detector are selected for feature extraction. By dividing the complex nodes with large amount of computation into several simple nodes, different graph execution models are constructed and mapped on the OpenVX parallel processor to realize image edge detection and feature point extraction respectively. Verilog is used to design the hardware circuit, and the FPGA chip xcvu440-flga-2892-2-e of Xilinx has comprehensively verified that, compared with the serial mapping structure, the parallel acceleration ratio of the selected kernel function on the OpenVX programmable parallel processor can be up to 14.269. Experimental results show that the kernel functions in OpenVX specification 1.3, especially the complex kernel functions, can achieve expected acceleration effect in this parallel processing structure, and the speedup ratio of parallel and serial structures increases linearly.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract334
    PDF215
    HTML3
    Practice on Program Energy Consumption Optimization by Energy Measurement and Analysis Using FPowerTool
    WEI Guang, QIAN Depei, YANG Hailong, LUAN Zhongzhi
    Journal of Frontiers of Computer Science and Technology    2022, 16 (6): 1291-1303.   DOI: 10.3778/j.issn.1673-9418.2102046

    Energy-aware programming (EAP) is a new approach to reduce energy consumption of computing systems. It introduces energy as one of the main design metrics into the process of software development to reduce program energy consumption by adjusting the way of programming. The implementation of EAP is facing some difficulties in finding energy consumption hot spots, identifying main factors which cause excessive energy consumption, and locating inappropriate code segments in the program. To address these issues, this paper proposes a new method called EPC (energy-performance correlation) for joint measurement and analysis of energy consumption and perfor-mance events during program execution. Firstly, the basic principles of EPC are introduced and the implementation of an EPC-based tool, FPowerTool, for program energy consumption measurement and analysis is presented. Then, the method of energy-performance events correlation analysis for identifying the main factors influencing energy consumption is presented. Finally, a set of programs is used as case studies to show how to locate the code segments related to high energy consumption by correlation analysis, and how to change the coding and data placement and access to reduce the program energy consumption. The experiment results show that based on the energy-aware and analysis capabilities provided by the EPC method, program performance and energy efficiency can be improved by improving data definition, assignment, placement, and access methods.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract452
    PDF164
    HTML5
    High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm
    GUO Jinxin, ZHANG Guangting, ZHANG Yunquan, CHEN Zehua, JIA Haipeng
    Journal of Frontiers of Computer Science and Technology    2022, 16 (6): 1304-1315.   DOI: 10.3778/j.issn.1673-9418.2011092

    The fast Fourier transform (FFT) algorithm is considered as an important element of the processor’s basic software ecology, and it is widely applied in the field of engineering, science, physics and mathematics. Meanwhile, the requirements for the performance of FFT in these applications are also continuously rising. Therefore, it is of definite significance to study the high-performance implementation of FFT algorithm, especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64, and to improve the calculation performance of FFT algorithm. In view of the architectural features of the ARMv8 and X86-64 computing platforms, this paper studies the high-performance implementation and optimization methods of the FFT algorithm. Through the application of butterfly network optimization, large radices network stages decrease, large radices butterfly computation optimization, SIMD (single instruction multiple data) assembly optimization, and register usage optimization methods, this paper effectively improves the performance of the FFT algorithm, considerably improves the calculation performance of the large radices of FFT, and solves the performance bottlenecks of insufficiency of register resources. Lastly, the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made. The experimental results indicate that for the ARM and X86-64 processors, the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL (ARM performance library), Intel MKL (math kernel library) and FFTW (fastest Fourier transform in the West) and can achieve a significant improvement in performance compared with small and medium radices.

    Table and Figures | Reference | Related Articles | Metrics
    Abstract793
    PDF420
    HTML85