Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (5): 1043-1052.DOI: 10.3778/j.issn.1673-9418.2011062

• Database Technology • Previous Articles     Next Articles

Research on User Similarity Calculation of Collaborative Filtering for Sparse Data

WU Sen, DONG Yaxian, WEI Guiying(), GAO Xiaonan   

  1. School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
  • Received:2020-11-23 Revised:2021-03-15 Online:2022-05-01 Published:2022-05-19
  • About author:WU Sen, born in 1971, Ph.D., professor. Her research interests include data mining, personalized recommendation, etc.
    DONG Yaxian, born in 1996, M.S. candidate. Her research interests include data processing and analysis, personalized recommendation, etc.
    WEI Guiying, born in 1969, Ph.D., associate professor. Her research interests include data mining, personalized recommendation, etc.
    GAO Xiaonan, born in 1996, Ph.D. candidate. Her research interests include data processing and analysis, personalized recommendation, etc.
  • Supported by:
    National Natural Science Foundation of China(71971025)

面向稀疏数据的协同过滤用户相似度计算研究

武森, 董雅贤, 魏桂英(), 高晓楠   

  1. 北京科技大学 经济管理学院,北京 100083
  • 通讯作者: + E-mail: weigy@manage.ustb.edu.cn
  • 作者简介:武森(1971—),女,辽宁开原人,博士,教授,主要研究方向为数据挖掘、个性化推荐等。
    董雅贤(1996—),女,天津人,硕士研究生,主要研究方向为数据处理与分析、个性化推荐等。
    魏桂英(1969—),女,河北承德人,博士,副教授,主要研究方向为数据挖掘、个性化推荐等。
    高晓楠(1996—),女,山西长治人,博士研究生,主要研究方向为数据处理与分析、个性化推荐等。
  • 基金资助:
    国家自然科学基金(71971025)

Abstract:

User-based collaborative filtering achieves recommendation for target users based on the preferences of their nearest neighbors, in which how to calculate user similarity is critical. The traditional rating similarity calculation relies on the scores of common scoring items. With the intensification of the sparsity of user-item scoring matrix, traditional rating similarity calculation is difficult to accurately measure the similarity between users. Along this line, traditional rating similarity calculation is difficult in selecting reliable nearest neighbors for the target user, which affects the final recommendation performance. Besides, structural similarity is another commonly used similarity calculation method in recommendation task, which is mostly measured by the proportion of users’ common scoring items. This kind of method is easy to calculate and less affected by data sparseness. However, its outputs are usually close, leading to the result that different user-pairs cannot be distinguished obviously. To solve the similarity calculation difficulty for collaborative filtering caused by data sparseness, a sparse cosine similarity is proposed in this paper. Firstly, this paper formulates a new structural similarity, sparse set simil-arity to differentiate users into two groups, high-correlation users and low-correlation users. Then, this paper deve-lops different rating similarity calculation methods for different kinds of users, which can eliminate the misleading produced by traditional rating similarity when the data is sparse. Finally, the sparse cosine similarity is constructed by combining the raised rating similarity and structural similarity. Experimental results show that compared with seven similarity calculation methods, the presented sparse cosine similarity can yield more accurate user similarity and improve the performance of recommendation task, overcoming the limitations that traditional rating methods are affected by data sparseness severely and the results produced by structural methods are not distinct significantly.

Key words: similarity measure, collaborative filtering, sparse data, recommendation system

摘要:

基于用户的协同过滤通过获取最近邻的偏好实现对目标用户偏好的预测推荐,相似度计算为其核心步骤。传统数值相似度计算依赖于用户共同评分项的评分数值,用户-项目评分矩阵稀疏程度的加剧导致数值相似度计算准确性降低,难以为目标用户选取可靠的最近邻,影响推荐效果;现有结构相似度大多利用用户共同评分项占比度量,计算简单,受数据稀疏影响较小但区分度低。针对上述协同过滤任务中数据稀疏带来的相似度计算问题,提出一种稀疏余弦相似度。首先定义新的结构相似度——稀疏集合相似度,将用户区分为高相关用户与低相关用户,并进一步针对不同类型用户设计差异化的数值相似度计算方式,以缓解传统数值相似度在面临数据稀疏时的不足,最终综合数值相似度与结构相似度形成稀疏余弦相似度。实验结果表明,与七种相似度计算方法相比,稀疏余弦相似度解决了传统数值相似度受数据稀疏影响严重和结构相似度计算结果区分度低的问题,可更准确计算用户相似度,提升推荐效果。

关键词: 相似度计算, 协同过滤, 稀疏数据, 推荐系统

CLC Number: