基于改进Canopy-Kmeans算法的并行化研究
DOI:
作者:
作者单位:

西安理工大学自动化与信息工程学院

作者简介:

通讯作者:

中图分类号:

TP301.6

基金项目:

陕西省科技计划重点项目(2017ZDCXL-GY-05-03)


Research on Parallelization Based on Improved Canopy-Kmeans AlgorithmWang Lin Jia Junche
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着互联网数据的快速增长,原始的Kmeans算法已经不足以应对大规模数据的聚类需求。为此,提出一种改进的Canopy-Kmeans聚类算法。首先面对Canopy算法中心点随机选取的不足,引入“最大最小原则”优化Canopy中心点的选取;接着借助三角不等式定理对Kmeans算法进行优化,减少冗余的距离计算,加快算法的收敛速度;最后结合MapReduce框架并行化实现改进的Canopy-Kmeans算法。基于构建的微博数据集,对优化后的Canopy-Kmeans算法进行测试。试验结果表明:对不同数据规模的微博数据集,优化后算法的准确率较Kmeans算法提高了约15%,较原始的Canopy-Kmeans算法提高了约7%,算法的执行效率和扩展性也有较大提升。

    Abstract:

    (School of Automation and Information Engineering, Xi"an University of Technology, Xi"an, 710048):With the rapid growth of Internet data, the original Kmeans algorithm is no longer sufficient to meet the clustering needs of large-scale data. To this end, an improved Canopy-Kmeans clustering algorithm is proposed. Faced with the shortcomings of the random selection of the center point of the Canopy algorithm, the "maximum principle" was introduced to optimize the selection of the Canopy center point; then the Kmeans algorithm was optimized with the help of the triangle inequality theorem to reduce redundant distance calculations and accelerate the convergence rate of the algorithm; finally Combined with MapReduce framework parallelization to achieve improved Canopy-Kmeans algorithm. Based on the constructed Weibo dataset, the optimized Canopy-Kmeans algorithm is tested. The test results show that the accuracy of the optimized algorithm is about 15% higher than that of the Kmeans algorithm and about 7% higher than that of the original Canopy-Kmeans algorithm. The execution efficiency and scalability of the algorithm are also improved. Greatly improved.

    参考文献
    相似文献
    引证文献
引用本文

王 林,贾钧琛.基于改进Canopy-Kmeans算法的并行化研究计算机测量与控制[J].,2021,29(2):176-179.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-06-22
  • 最后修改日期:2020-07-07
  • 录用日期:2020-07-07
  • 在线发布日期: 2021-02-08
  • 出版日期: