1.Gclust:A Parallel Clustering Tool for Microbial Genomic Data
Li RUILIN ; He XIAOYU ; Dai CHUANGCHUANG ; Zhu HAIDONG ; Lang XIANYU ; Chen WEI ; Li XIAODONG ; Zhao DAN ; Zhang YU ; Han XINYIN ; Niu TIE ; Zhao YI ; Cao RONGQIANG ; He RONG ; Lu ZHONGHUA ; Chi XUEBIN ; Li WEIZHONG ; Niu BEIFANG
Genomics, Proteomics & Bioinformatics 2019;17(5):496-502
The accelerating growth of the public microbial genomic data imposes substantial bur-den on the research community that uses such resources. Building databases for non-redundant ref-erence sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algo-rithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demon-strate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.