Construction and application of differential privacy fuzzy C-means clustering algorithm based on Gaussian kernel function
10.3969/j.issn.1672-8270.2024.08.020
- VernacularTitle:基于高斯核函数的差分隐私模糊C均值聚类算法的构建与应用
- Author:
Zixiong CAO
1
;
Yuxian CHEN
;
Xiumei JIANG
Author Information
1. 淮安市第二人民医院信息统计中心 淮安 223003
- Keywords:
Data privacy;
Differential privacy;
Fuzzy C-means clustering algorithm;
Gaussian kernel function;
Data mining;
Privacy budget
- From:
China Medical Equipment
2024;21(8):106-112
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To propose a differential privacy fuzzy C-means clustering algorithm based on Gaussian kernel function(DPFCM_GF)to optimize the data privacy and security issues brought about by data analysis and mining of medical data in the context of big data,and to provide a theoretical basis for data privacy protection.Methods:In order to solve the problem of reducing the accuracy of the algorithm by randomly initializing the fuzzy C-mean membership matrix,the maximum distance method was used to determine the initial center point,the Gaussian value of the clustering center point was used to calculate the privacy budget allocation ratio,and the Laplace noise was added to complete the differential privacy protection and the DPFCM_GF was constructed.The effectiveness of DPFCM_GF was verified by collecting and collating the heart disease,breast cancer,thyroid disease and diabetes public data sets from the machine learning repository of the University of California,Irvine,and the gastric cancer and lung cancer datasets of The Second People's Hospital Huai'an were collected to verify the usability of the DPFCM_GF,and the analysis results were compared with the fuzzy C-means clustering algorithm(FCM)and the differentially private fuzzy C-means clustering algorithm(DPFCM).Results:For public datasets of heart disease,breast cancer,thyroid disease and diabetes,the optimal clustering effects of DPFCM_GF and DPFCM were equivalent to those of FCM;compared with DPFCM,the iteration time of DPFCM_GF was faster,the convergence speed was significant,the difference was statistically significant(t=4.01,4.71,4.01,12.38,P<0.05).For lung cancer and gastric cancer dataset,with the increase of privacy budget ε,the correct recognition rate of DPFCM_GF gradually converged to 91.9%and 93.9%,and the AUC values converged to 0.79 and 0.81,respectively,and when the privacy function ε was 0.1,0.5,1 and 2(ε<3),the DPFCM_GF clustering effect was significantly better than that of DPFCM,and the clustering effect was better and the difference was statistically significant(x2=12.25,87.12,68.58,7.76,P<0.05;x2=4.74,43.51,42.47,4.89,P<0.05).Conclusion:The DPFCM_GF is an effective method for protecting the privacy of medical data,and can also perform data analysis and mining tasks,which has certain research significance and prospects.