Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature.
10.7507/1001-5515.201911064
- Author:
Ying CUI
1
,
2
;
Zelong XU
2
;
Jianzhong LI
1
,
3
Author Information
1. Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
2. School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China.
3. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, P.R.China.
- Publication Type:Journal Article
- Keywords:
euclidean distance;
nucleosome;
position weight matrix;
sequence feature;
support vector machine;
z-curve
- From:
Journal of Biomedical Engineering
2020;37(3):496-501
- CountryChina
- Language:Chinese
-
Abstract:
In this article, based on z-curve theory and position weight matrix (PWM), a model for nucleosome sequences was constructed. Nucleosome sequence dataset was transformed into three-dimensional coordinates, PWM of the nucleosome sequences was calculated and the similarity score was obtained. After integrating them, a nucleosome feature model based on the comprehensive DNA sequences was obtained and named CSeqFM. We calculated the Euclidean distance between nucleosome sequence candidates or linker sequences and CSeqFM model as the feature dataset, and put the feature datasets into the support vector machine (SVM) for training and testing by ten-fold cross-validation. The results showed that the sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) of identifying nucleosome positioning for were 97.1%, 96.9%, 94.2% and 0.89, respectively, and the area under the receiver operating characteristic curve (AUC) was 0.980 1. Compared with another z-curve method, it was found that our method had better identifying effect and each evaluation performance showed better superiority. CSeqFM method was applied to identify nucleosome positioning for other three species, including , and . The results showed that AUCs of the three species were all higher than 0.90, and CSeqFM method also showed better stability and effectiveness compared with iNuc-STNC and iNuc-PseKNC methods, which is further demonstrated that CSeqFM method has strong reliability and good identification performance.