Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (2): 305-314.DOI: 10.3778/j.issn.1673-9418.1912048

• Artificial Intelligence • Previous Articles     Next Articles

Document Classification Method Based on Context Awareness and Hierarchical Attention Network

REN Jianhua, LI Jing, MENG Xiangfu   

  1. School of Electronics and Information Engineering, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2021-02-01 Published:2021-02-01

上下文感知与层级注意力网络的文档分类方法

任建华李静孟祥福   

  1. 辽宁工程技术大学 电子与信息工程学院, 辽宁 葫芦岛 125105

Abstract:

Document classification is a basic problem in the field of natural language processing (NLP). In recent years, although hierarchical attention networks have made progress, because each sentence is coded independently, bidirectional encoder used in the model can only consider the adjacent sentence of the coded sentence, still focuses on the currently encoded sentences, and does not effectively integrate document structure knowledge into the archi-tecture. To solve this problem, document classification method based on context awareness and hierarchical atten-tion network (CAHAN) is proposed. This method uses a hierarchical structure to represent the hierarchical structure of the document, and uses the attention mechanism to consider the important sentences in the document and the important word factors in the sentence. At the word level and sentence level, it not only relies on the bidirectional encoder to obtain context information, but also introduces the context vector in the word-level attention mechanism to make the word-level encoder make attention decisions based on the context information to fully obtain the context information of the text, thereby extracting the depth document characteristics. In addition, the gating mechanism is used to accurately determine how much context information should be considered. The experimental results on two standard data sets show that the proposed CAHAN model has better classification effects than long short-term memory (LSTM), convolutional neural networks (CNN), and hierarchical attention network (HAN), which can improve the accuracy of document classification tasks.

Key words: natural language processing (NLP), document classification, context-aware, hierarchical attention, gating mechanism

摘要:

文档分类是自然语言处理(NLP)领域中的一个基本问题。近年来,尽管针对这一问题的层级注意力网络已经取得了进展,但由于每条句子被独立编码,使得模型中使用的双向编码器仅能考虑到所编码句子的相邻句子,仍然集中于当前所编码的句子,并没有有效地将文档结构知识整合到体系结构中。针对此问题,提出一种上下文感知与层级注意力网络的文档分类方法(CAHAN)。该方法采用分层结构来表示文档的层次结构,使用注意力机制考虑文档中重要的句子和句子中重要的单词因素,在单词级和句子级不仅依赖双向编码器来获取上下文信息,还通过在单词级注意机制中引入上下文向量,使单词级编码器基于上下文信息做出注意决策全面获取文本的上下文信息,从而提取出深度文档特征。此外,还利用门控机制准确地决定应该考虑多少上下文信息。在两个标准数据集上的实验结果表明,提出的CAHAN模型较长短时记忆网络(LSTM)、卷积神经网络(CNN)、分层注意网络(HAN)等模型分类效果更好,能够提高文档分类任务的准确度。

关键词: 自然语言处理(NLP), 文档分类, 上下文感知, 层级注意力, 门控机制