Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (6): 719-725.DOI: 10.3778/j.issn.1673-9418.1409014

Previous Articles     Next Articles

User-Type Classification in Micro-Blog Based on Information of Authenticated User

HUANG Lei, LI Shoushan+, WANG Jingjing   

  1. School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Online:2015-06-01 Published:2015-06-04


黄  磊,李寿山+,王晶晶   

  1. 苏州大学 计算机与科学技术学院,江苏 苏州 215006

Abstract: The micro-blog users can be categorized into two types: human and nonhuman users. Automatic classification of the two user types is a basic task for many real applications, such as intelligent advertising and personality analysis. This paper proposes an automatic classification method based on machine learning to tackle this task. One distinguishing feature of the proposed method is that the corpus of authenticated users is used as natural labeled data to train a classifier, instead of manual labeling the data. Specifically, the username and message text published by user are employed to represent one user. Then, the maximum entropy algorithm is utilized to perform the classification. The experimental research on Sina Weibo demonstrates that the proposed method is very effective for user-type classification.

Key words: natural language processing, micro-blog, user-type classification, authentication

摘要: 微博用户可以分为个人用户和非个人用户两种类型。在微博中对这两种用户类型进行自动分类是智能广告、用户个性分析等应用的一项基本任务。针对该任务,提出了一种基于机器学习的自动分类方法。该方法的特色在于,不需要人工标注样本,而是利用微博中认证用户类型的语料作为训练样本构建分类器,用于对非认证用户类型进行分类。具体实现中,将用户名和用户发表的微博文本作为表示用户的样本,使用基于最大熵算法进行用户分类。实验表明这种利用认证用户对非认证用户进行类型分类的方法能够获得较好的效果。

关键词: 自然语言处理, 微博, 用户分类, 认证