社交网络中spammer检测技术研究
摘要
快速兴起的社交网络逐渐成为人们获取和分享信息的重要平台。然而,拥有海量用户体的社交网络也吸引了大量以获利为目的的垃圾用户(Spammer),给正常用户和社交平台带来了严重危害。
社交网络中用户特征多种多样,如何选取合适的特征是Spammer检测的关键问题之一。同时,现阶段的Spammer检测技术多采用机器学习算法。其中,无监督检测算法虽然不需要有标注的数据,但是准确率低,难以满足检测要求;有监督检测算法需要大量人工标注数据且容易被Spammer改变策略绕过检测系统,效率较低。针对上述问题,本文具体的研究内容如下:
1. 针对社交网络Spammer检测中的特征选择问题,本文设计一种基于综合过滤器排名(Comprehensive Filter Ranking, CFR)和遗传算法(Genetic Algorithm, GA)组合的特征选择算法CFR-GA(Comprehensive Filter Ranking-Genetic Algorithm),并将该算法用于下一步的Spammer检测中。该算法首先利用基于过滤器的CFR算法计算特征的综合得分并从大到小排序,删除综合排名靠后的特征,缩小后续GA 的搜索范围;同时利用得到的每个特征的综合得分指导GA进行种初始化,提高GA的运行效率;最后,利用GA进行搜索得到最佳的特征子集。实验证明,该算法获得的特征子集维数较小、分类性能较高,且运行效率优于传统的GA算法。
安监局个人工作总结
2. 针对社交网络Spammer检测中的人工标注数据问题,设计一种基于OPTICS(Ordering Points To Identify the Clustering Structure)和SVM(Support Vector Machine)混合分类模型OSHCM(OPTICS and SVM based Hybrid Classification Model)的Spammer检测算法。该算法首先通过OPTICS算法对数据进行聚类,得到数据的初始类别标签;然后根据聚类得到的簇中样本稀疏程度确定一些可靠的学习样本;接着使用之前设计的CFR-GA算法选出最优的特征子集;最后,将训练样本和最优特征子集用于训练SVM分类器,再用SVM分类器对原始数据分类。实验表明,该算法的分类评估指标接近SVM算法,和无监督的OPTICS检测算法相比有了较大的提高,且不需要人工标注数据。
关键词:社交网络,Spammer检测,特征选择,机器学习
四姑娘山旅游攻略Abstract
Social networks have become important platforms for people to obtain, share and disseminate information with its prosperity. However, the social networks which have vast number of users have also attracted a lot of spammers with the purpose of profit, which has brought severely harm for legitimate users and social platforms.
There are many kinds of user features in social networks, so how to select the appropriate features for spammer detection is one of the key problems. At the same time, the spammer detection technology is mainly based on machine learning algorithm. Although the unsupervised learning detection algorithm does not need labeled data, the accuracy is too low to meet the requirements of detection; supervised learning detection algorithm requires a large amount of labeled data and spammers usually change strategies to bypass the detection system, which leads to low efficiency. To address the problems above, the specific contents of this thesis are as follows:升旗时间
1. Aiming at the above problem of feature selection in spammer detection, a feature selection algorithm named CFR-GA which combines the comprehensive filter ranking (CFR) with genetic algorithm (GA) is proposed in this thesis, and the CFR-GA is applied in spammer detection algorithm. Firstly, the CFR algorithm based on filter is used to calculate the comprehensive scores of features which are sorted from large to small, and then the features which have lower rankings are deleted to reduce the search range of GA; secondly, the comprehensive scores are used to guide GA to initialize the population which can improve the running efficiency of GA; finally, GA is utilized to search the optimal feature subset. The experiment results show that the feature subset obtained by CFR-GA has smaller dimensions and better classification performance. And compared with GA, CFR-GA has higher efficiency.
2. Aiming at the above problem of labelling data manually in spammer detection, this thesis proposes a novel spammer detection algorithm based on ordering points to identify the clustering structure (OPTICS) and support vector machine (SVM) which is named OSHCM. Firstly, the OPTICS algorithm is used to generate clusters and thus the initial class labels of data are obtained; secondly, according to the denseness of samples obtained from clusters, some reliable learning samples are selected; thirdly, a feature subset is generated by CFR-GA; finally, the training samples and feature subset are
used to train SVM classifier, and then the trained SVM classifier is used to reclassify the original data. The experiment results show that the detection result of the algorithm is close to SVM and achieves great improvement than OPTICS without labeled dataset.
Keywords: social network, spammer detection, feature selection, machine learning
目录
图录 ............................................................................................................................... VI 表录 .............................................................................................................................. V II 第1章绪论 . (1)
1.1 研究背景及意义 (1)
1.2 国内外研究现状 (2)
1.3 论文研究内容 (5)
1.4 论文结构安排 (6)
第2章相关技术概述 (7)
2.1Spammer概述 (7)
2.2 特征选择 (7)
2.2.1 特征选择概述 (7)
2.2.2Filter类典型特征选择算法 (9)
2.2.3Wrapper类典型特征选择算法 (11)
2.3 机器学习 (13)
2.3.1OPTICS算法 (14)
2.3.2SVM算法 (15)
2.4 实验数据集及评估指标 (16)
2.4.1 数据抓取 (16)
2.4.2 数据集描述 (17)
2.4.3 评估指标 (18)
2.5 本章小结 (20)
第3章微博用户特征分析及选取 (21)
3.1 引言 (21)
3.2 特征分析 (21)
3.2.1 基于内容的特征分析 (22)
净水器哪种好3.2.2 基于用户行为的特征分析 (25)
3.3 一种基于CFR-GA的特征选择算法 (27)
3.3.1CFR-GA算法整体描述 (27)
3.3.2 基于CFR的特征预筛选阶段 (29)
3.3.3 基于GA的Wrapper阶段 (31)
3.4 相关实验与结果分析 (32)
3.4.1 实验环境 (32)
3.4.2CFR-GA算法实现过程 (33)男生英文名字大全
3.4.3 算法评估与对比 (34)
3.5 本章小结 (37)
温柔的星空应该让你感动是什么歌第4章社交网络中Spammer检测方法 (38)
4.1 引言 (38)
4.2 基于OSHCM的社交网络Spammer检测框架 (38)
4.3 Spammer检测算法的设计思想 (39)
4.3.1 相关定义 (39)
4.3.2OSHCM混合分类模型 (41)
4.3.3 基于OSHCM的Spammer检测算法 (42)
4.4 相关实验与结果分析 (43)
4.4.1 实验环境 (43)
4.4.2 基于OSHCM的Spammer检测实现过程 (43)
4.4.3 算法评估与对比 (45)
4.5 本章小结 (47)
第5章总结与展望 (48)
5.1 论文总结 (48)
5.2 未来工作展望 (49)
参考文献 (50)
致谢 (54)
攻读硕士学位期间从事的科研工作及取得的成果 (55)

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。