Gclust:A Parallel Clustering Tool for Microbial Genomic Data
- Author:
Li RUILIN
1
;
He XIAOYU
;
Dai CHUANGCHUANG
;
Zhu HAIDONG
;
Lang XIANYU
;
Chen WEI
;
Li XIAODONG
;
Zhao DAN
;
Zhang YU
;
Han XINYIN
;
Niu TIE
;
Zhao YI
;
Cao RONGQIANG
;
He RONG
;
Lu ZHONGHUA
;
Chi XUEBIN
;
Li WEIZHONG
;
Niu BEIFANG
Author Information
- Keywords: Microbial genome clustering; Parallelization; Sparse suffix array; Maximal exact match; Segment extension
- From: Genomics, Proteomics & Bioinformatics 2019;17(5):496-502
- CountryChina
- Language:Chinese
- Abstract: The accelerating growth of the public microbial genomic data imposes substantial bur-den on the research community that uses such resources. Building databases for non-redundant ref-erence sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algo-rithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demon-strate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.