Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (9): 2370-2383.DOI: 10.3778/j.issn.1673-9418.2311063

• Theory·Algorithm • Previous Articles     Next Articles

Research on Processing and Application of Imbalanced Textual Data on Social Platforms

JIANG Yuqi, HOU Zhiwen, WANG Yifan, ZHAI Hanming, BU Fanliang   

  1. College of Information Network Security, People’s Public Security University of China, Beijing 100240, China
  • Online:2024-09-01 Published:2024-09-01

社交平台不平衡文本数据处理与应用研究

姜钰棋,侯智文,王一帆,翟晗名,卜凡亮   

  1. 中国人民公安大学 信息网络安全学院,北京 100240

Abstract: With the informatization of the society, it’s of great practical value to extract useful information from massive textual data available online using tools of NLP (natural language processing). However, the texts collected from social platforms suffer from issues such as low amount of valuable data and data imbalance. This paper proposes two methods to deal with these problems, named SimDyFeFL (SimBERT & dynamic feedback Focal Loss) and EdaDyFeFL(EDA & dynamic feedback Focal Loss), one is applicable for crisis-related information recognition tasks in Chinese, and another is for cyber trolls detection tasks in English. Specifically, SimBERT and EDA (easy data augmentation) methods are used to augment the original data with large differences between classes to a similar number of classes, and then the Focal Loss function with dynamic feedback process is fused to weight each class. Then, BERT (bidirectional encoder representations from transformers), RoBERTa (robustly optimized BERT pre-training approach), and BERT_DPCNN (BERT deep pyramid convolutional neural networks) text classification models are designed for three-stage comparative experiments to validate the effectiveness of proposed methods. Extensive experiments on two real datasets in Chinese and English show that the performance of the improved text classification models using SimDyFeFL and EdaDyFeFL is significantly improved, the accuracy of Chinese model is increased by 7.70 percentage points, and the accuracy of English model is increased by 5.15 percentage points. Compared with the best results on the Kaggle platform, the accuracy of the English model is 2.92 percentage points higher, and the Macro F1 score and Weighted F1 score are 2.83 percentage points and 2.95 percentage points higher, respectively.

Key words: text classification on social platforms, processing of imbalanced data, SimBERT, EDA (easy data augmentation), Focal Loss

摘要: 随着社会信息化程度加深,运用自然语言处理技术从海量网络数据中筛选提取有效信息,具有重要的实用价值。然而,从社交平台收集的文本数据存在有效信息类别数据量少、类别不平衡等问题。因此,提出SimDyFeFL方法解决中文应急关联文本识别任务的数据不均衡问题,EdaDyFeFL方法解决英文网络暴力检测任务的数据不均衡问题。采用SimBERT与EDA方法将类间差异较大的原始数据增强至类间数量相近后,融合加入动态反馈过程的Focal Loss函数对各类别加权,并设计BERT、RoBERTa与BERT_DPCNN作为文本分类模型进行三个阶段的对比实验,证明提出方法的有效性。在中、英文两个真实数据集上的大量实验表明,使用SimDyFeFL与EdaDyFeFL改进后的文本分类模型综合性能提升显著,中文模型准确率最高提升7.70个百分点,英文模型准确率最高提升5.15个百分点。与Kaggle平台已有研究取得的最好成绩相比,英文模型准确率高出了2.92个百分点,Macro F1值与Weighted F1值分别高出2.83个百分点与2.95个百分点。

关键词: 社交平台文本分类, 不平衡数据处理, SimBERT, EDA, Focal Loss