Construction and application of a random forest-based classification model for DNA double-strand break induced by ionizing radiation

Jinhua CHEN; Xiaoting HUANG; Jiaying JIN; Boyang DING; Ran ZHU; Wenyan LI; Fenju LIU; Jiahua YU

Return

Construction and application of a random forest-based classification model for DNA double-strand break induced by ionizing radiation

VernacularTitle:基于随机森林的电离辐射诱导DNA双链断裂分类模型的构建与应用
Author: Jinhua CHEN ¹ ; Xiaoting HUANG ; Jiaying JIN ; Boyang DING ; Ran ZHU ; Wenyan LI ; Fenju LIU ; Jiahua YU
Author Information

1. 苏州大学放射医学与防护学院　放射医学与辐射防护国家重点实验室　江苏省高校放射医学协同创新中心　215123
Keywords: Ionizing radiation; DNA double-strand break; Random forest; Classification model; Epigenetics
From: Chinese Journal of Radiological Medicine and Protection 2021;41(6):413-417
CountryChina
Language:Chinese
Abstract: Objective:To construct a random forest classification model of DNA double strand breaks (DSB) induced by ionizing radiation and investigate the genome-wide distribution of DSB.Methods:The GRCh38 reference genome was divided into 50 kilobase fragments. Then these genomic fragments were separated into low-level or high-level regions of ionizing radiation-induced DSB according to the sequencing data of MCF-7 cells. The data of eight epigenetic features were used as input. Two thirds of the data were randomly assigned to the training set, and the rest of the data was assigned to the test set. A random forest classification model with 100 decision trees was constructed. The importance of epigenetic features in the classification model was analyzed and displayed.Results:The accuracy score of the random forest classification model on the test set was 99.4%, the precision score was 98.9% and the recall score was 99.9%. The area under the receiver operating characteristic curve was 0.994. Among the eight epigenetic features, H3K36me3 and DNase markers were the most important variables. The enrichments of the two markers in DSB high-level regions were much higher than those in DSB low-level regions.Conclusions:The random forest classification model could precisely predict the genome-wide levels of DSB induced by ionizing radiation in the 50 kilobase window based on epigenetic features. Analysis revealed that these DSB might primarily distribute in the actively transcribed sites in the genome.