Gclust:A Parallel Clustering Tool for Microbial Genomic Data

Li RUILIN; He XIAOYU; Dai CHUANGCHUANG; Zhu HAIDONG; Lang XIANYU; Chen WEI; Li XIAODONG; Zhao DAN; Zhang YU; Han XINYIN; Niu TIE; Zhao YI; Cao RONGQIANG; He RONG; Lu ZHONGHUA; Chi XUEBIN; Li WEIZHONG; Niu BEIFANG

Return

Gclust:A Parallel Clustering Tool for Microbial Genomic Data

Author: Li RUILIN ¹ ; He XIAOYU ; Dai CHUANGCHUANG ; Zhu HAIDONG ; Lang XIANYU ; Chen WEI ; Li XIAODONG ; Zhao DAN ; Zhang YU ; Han XINYIN ; Niu TIE ; Zhao YI ; Cao RONGQIANG ; He RONG ; Lu ZHONGHUA ; Chi XUEBIN ; Li WEIZHONG ; Niu BEIFANG
Author Information

1. Computer Network Information Center
Keywords: Microbial genome clustering; Parallelization; Sparse suffix array; Maximal exact match; Segment extension
From: Genomics, Proteomics & Bioinformatics 2019;17(5):496-502
CountryChina
Language:Chinese
Abstract: The accelerating growth of the public microbial genomic data imposes substantial bur-den on the research community that uses such resources. Building databases for non-redundant ref-erence sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algo-rithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demon-strate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.