Performance validation and result analysis of bioinformatics procedure for metagenomic next-generation sequencing
10.3760/cma.j.cn114452-20240314-00135
- VernacularTitle:病原宏基因组测序生物信息学分析流程的性能验证及结果分析
- Author:
Wen XI
1
;
Yang XIAO
1
;
Shangdong YANG
1
;
Zhe LIU
1
;
Fang WANG
1
;
Xiaoqin WANG
1
Author Information
1. 西安交通大学第一附属医院检验科,西安 710061
- Publication Type:Journal Article
- Keywords:
Metagenome;
Bioinformatics;
Pathogen;
Simulated datasets;
Performance verification
- From:
Chinese Journal of Laboratory Medicine
2025;48(1):117-124
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To establish a preliminary performance validation protocol for the bioinformatics procedure of metagenomic next-generation sequencing (mNGS) in clinical laboratories.Methods:Three types of simulated datasets were designed and the CatⅠ dataset mainly consisted of pathogen reference genomes and human sequences. CatⅠA was a dataset composed of common pathogens mixed with human sequences and was used to evaluate the inclusiveness, accuracy, recall rates, precision, F1-Score, and other indicators of the mNGS bioinformatics procedure. CatⅠB was a dataset composed of closely related species of common pathogens mixed with human sequences, which was used to evaluate the discriminating ability of closely related species of bioinformatics procedure by calculating the detection rates and the relative abundance ratio of closely related species. The real data of 200 clinical samples was selected to construct CatⅡ and the simulated dataset consisted of colonized bacteria, experimental environment bacteria, reagent engineering bacteria, pathogen reference genomes, and human sequences, which was used to evaluate the sensitivity, specificity, and accuracy of bioinformatics pipeline for pathogens detection. The CatⅢ dataset was obtained from the negative bronchoalveolar lavage fluid BALF sequencing data mixed with 20 rare pathogens sequences in order to evaluate the positive detection rates and recall rates of rare pathogens in the bioinformatics analysis.Results:The analysis of the CatⅠA dataset showed that the positive consistency rate, inclusiveness, precision and accuracy of the bioinformatics peocedure under three sequence gradients were all greater than 99%, with a recall rate of 72.31% (95% CI 69.61%-75.01%) and a F1 Score of 82.00% (95% CI 79.77%-84.22%). In the CatⅠB dataset, the closely related species could be effectively detected at all sequence and proportion gradients, and the relative abundance ratio of closely related species was within 2 times of the design ratio except for the coronavirus, haemophilus, primate bocaparvovirus, human respiratory syncytial virus, and eimeria, indicating good ability to identify the closely related species. All the 24 species of pathogens included in CatⅡ dataset were effectively detected, with the sensitivity, specificity, and accuracy all greater than 90%. All rare pathogens were detected in the CatⅢ dataset, with a detection rate of 100%. Conclusions:With the simulated datasets, the performance validation scheme for the mNGS bioinformatics analysis was preliminary established and could evaluate the accuracy of sequence classification, the ability to identify the closely related species, and detection ability of common and rare pathogens, which may provide some references for the construction of mNGS process.