1.A Brief Review of Software Tools for Pangenomics
Xiao JINGFA ; Zhang ZHEWEN ; Wu JIAYAN ; Yu JUN
Genomics, Proteomics & Bioinformatics 2015;(1):73-76
Since the proposal for pangenomic study, there have been a dozen software tools actively in use for pangenomic analysis. By the end of 2014, Panseq and the pan-genomes analysis pipeline (PGAP) ranked as the top two most popular packages according to cumulative citations of peer-reviewed scientific publications. The functions of the software packages and tools, albeit variable among them, include categorizing orthologous genes, calculating pangenomic profiles, integrating gene annotations, and constructing phylogenies. As epigenomic elements are being gradually revealed in prokaryotes, it is expected that pangenomic databases and toolkits have to be extended to handle information of detailed functional annotations for genes and non-protein-coding sequences including non-coding RNAs, insertion elements, and conserved structural elements. To develop better bioinformatic tools, user feedback and integration of novel features are both of essence.
2.Ribogenomics:the Science and Knowledge of RNA
Wu JIAYAN ; Xiao JINGFA ; Zhang ZHANG ; Wang XUMIN ; Hu SONGNIAN ; Yu JUN
Genomics, Proteomics & Bioinformatics 2014;(2):57-63
Ribonucleic acid (RNA) deserves not only a dedicated field of biological research -- a discipline or branch of knowledge -- but also explicit definitions of its roles in cellular processes and molecular mechanisms. Ribogenomics is to study the biology of cellular RNAs, including their origin, biogenesis, structure and function. On the informational track, messenger RNAs (mRNAs) are the major component of ribogenomes, which encode proteins and serve as one of the four major components of the translation machinery and whose expression is regulated at multiple levels by other operational RNAs. On the operational track, there are several diverse types of RNAs--their length distribution is perhaps the most simplistic stratification--involving in major cellular activ-ities, such as chromosomal structure and organization, DNA replication and repair, transcriptional/post-transcriptional regulation, RNA processing and routing, translation and cellular energy/metabolism regulation. An all-out effort exceeding the magnitude of the Human Genome Project is of essence to construct just mammalian transcriptomes in multiple contexts including embryonic development, circadian and seasonal rhythms, defined life-span stages, pathological conditions and anatomy-driven tissue/organ/cell types.
3.Compositional Variability and MutationSpectra of Monophyletic SARS-CoV-2 Clades
Teng XUFEI ; Li QIANPENG ; Li ZHAO ; Zhang YUANSHENG ; Niu GUANGYI ; Xiao JINGFA ; Yu JUN ; Zhang ZHANG ; Song SHUHUI
Genomics, Proteomics & Bioinformatics 2020;18(6):648-663
COVID-19 and its causative pathogen SARS-CoV-2 have rushed the world into a stag-gering pandemic in a few months, and a global fight against both has been intensifying. Here, we describe an analysis procedure where genome composition and its variables are related, through the genetic code to molecular mechanisms, based on understanding of RNA replication and its feed-back loop from mutation to viral proteome sequence fraternity including effective sites on the replicase-transcriptase complex. Our analysis starts with primary sequence information, identity-based phylogeny based on 22,051 SARS-CoV-2 sequences, and evaluation of sequence variation patterns as mutation spectra and its 12 permutations among organized clades. All are tailored to two key mechanisms: strand-biased and function-associated mutations. Our findings are listed as follows: 1) The most dominant mutation is C-to-U permutation, whose abundant second-codon-position counts alter amino acid composition toward higher molecular weight and lower hydropho-bicity, albeit assumed most slightly deleterious. 2) The second abundance group includes three negative-strand mutations (U-to-C, A-to-G, and G-to-A) and a positive-strand mutation (G-to-U) due to DNA repair mechanisms after cellular abasic events. 3) A clade-associated biased muta-tion trend is found attributable to elevated level of negative-sense strand synthesis. 4) Within-clade permutation variation is very informative for associating non-synonymous mutations and viral pro-teome changes. These findings demand a platform where emerging mutations are mapped onto mostly subtle but fast-adjusting viral proteomes and transcriptomes, to provide biological and clinical information after logical convergence for effective pharmaceutical and diagnostic applica-tions. Such actions are in desperate need, especially in the middle of the War against COVID-19.
4.Genome Warehouse: A Public Repository Housing Genome-scale Data
Chen MEILI ; Ma YINGKE ; Wu SONG ; Zheng XINCHANG ; Kang HONGEN ; Sang JIAN ; Xu XINGJIAN ; Hao LILI ; Li ZHAOHUA ; Gong ZHENG ; Xiao JINGFA ; Zhang ZHANG ; Zhao WENMING ; Bao YIMING
Genomics, Proteomics & Bioinformatics 2021;19(4):584-589
The Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB;https://ngdc.cncb.ac.cn), GWH accepts both full and partial (chloroplast, mitochondrion, and plasmid) genome sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata of biological project, biological sample, and genome assembly, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By May 21, 2021, GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
5.The Genome Sequence Archive Family:Toward Explosive Data Growth and Diverse Data Types
Chen TINGTING ; Chen XU ; Zhang SISI ; Zhu JUNWEI ; Tang BIXIA ; Wang ANKE ; Dong LILI ; Zhang ZHEWEN ; Yu CAIXIA ; Sun YANLING ; Chi LIANJIANG ; Chen HUANXIN ; Zhai SHUANG ; Sun YUBIN ; Lan LI ; Zhang XIN ; Xiao JINGFA ; Bao YIMING ; Wang YANQING ; Zhang ZHANG ; Zhao WENMING
Genomics, Proteomics & Bioinformatics 2021;19(4):578-583
The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence data, which provides data storage and sharing services for worldwide scientific communities. Considering explosive data growth with diverse data types, here we present the GSA family by expanding into a set of resources for raw data archive with different purposes, namely, GSA (https://ngdc.cncb.ac.cn/gsa/), GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/), and Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/). Compared with the 2017 version, GSA has been significantly updated in data model, online functionalities, and web interfaces. GSA-Human, as a new partner of GSA, is a data repository specialized in human genetics-related data with controlled access and security. OMIX, as a critical complement to the two resources mentioned above, is an open archive for miscellaneous data. Together, all these resources form a family of resources dedicated to archiving explosive data with diverse types, accepting data submissions from all over the world, and providing free open access to all publicly available data in support of worldwide research activities.
6.The Global Landscape of SARS-CoV-2 Genomes, Variants, and Haplotypes in 2019nCoVR
Song SHUHUI ; Ma LINA ; Zou DONG ; Tian DONGMEI ; Li CUIPING ; Zhu JUNWEI ; Chen MEILI ; Wang ANKE ; Ma YINGKE ; Li MENGWEI ; Teng XUFEI ; Cui YING ; Duan GUANGYA ; Zhang MOCHEN ; Jin TONG ; Shi CHENGMIN ; Du ZHENGLIN ; Zhang YADONG ; Liu CHUANDONG ; Li RUJIAO ; Zeng JINGYAO ; Hao LILI ; Jiang SHUAI ; Chen HUA ; Han DALI ; Xiao JINGFA ; Zhang ZHANG ; Zhao WENMING ; Xue YONGBIAO ; Bao YIMING
Genomics, Proteomics & Bioinformatics 2020;18(6):749-759
On January 22, 2020, China National Center for Bioinformation (CNCB) released the 2019 Novel Coronavirus Resource (2019nCoVR), an open-access information resource for the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 2019nCoVR features a comprehensive integra-tion of sequence and clinical information for all publicly available SARS-CoV-2 isolates, which are manually curated with value-added annotations and quality evaluated by an automated in-house pipeline. Of particular note, 2019nCoVR offers systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale. It provides all identified variants and their detailed statistics for each virus isolate, and congregates the quality score, functional annotation,and population frequency for each variant. Spatiotemporal change for each variant can be visualized and historical viral haplotype network maps for the course of the outbreak are also generated based on all complete and high-quality genomes available. Moreover, 2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on the coronavirus disease 2019 (COVID-19), including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC. Furthermore, by linking with relevant databases in CNCB, 2019nCoVR offers data submission services for raw sequence reads and assembled genomes, and data sharing with NCBI. Collectively, SARS-CoV-2 is updated daily to collect the latest information on genome sequences, variants, hap-lotypes, and literature for a timely reflection, making 2019nCoVR a valuable resource for the global research community. 2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.
7.The Elements of Data Sharing.
Zhang ZHANG ; Shuhui SONG ; Jun YU ; Wenming ZHAO ; Jingfa XIAO ; Yiming BAO
Genomics, Proteomics & Bioinformatics 2020;18(1):1-4
8.Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome.
Zhenglin DU ; Liang MA ; Hongzhu QU ; Wei CHEN ; Bing ZHANG ; Xi LU ; Weibo ZHAI ; Xin SHENG ; Yongqiao SUN ; Wenjie LI ; Meng LEI ; Qiuhui QI ; Na YUAN ; Shuo SHI ; Jingyao ZENG ; Jinyue WANG ; Yadong YANG ; Qi LIU ; Yaqiang HONG ; Lili DONG ; Zhewen ZHANG ; Dong ZOU ; Yanqing WANG ; Shuhui SONG ; Fan LIU ; Xiangdong FANG ; Hua CHEN ; Xin LIU ; Jingfa XIAO ; Changqing ZENG
Genomics, Proteomics & Bioinformatics 2019;17(3):229-247
To unravel the genetic mechanisms of disease and physiological traits, it requires comprehensive sequencing analysis of large sample size in Chinese populations. Here, we report the primary results of the Chinese Academy of Sciences Precision Medicine Initiative (CASPMI) project launched by the Chinese Academy of Sciences, including the de novo assembly of a northern Han reference genome (NH1.0) and whole genome analyses of 597 healthy people coming from most areas in China. Given the two existing reference genomes for Han Chinese (YH and HX1) were both from the south, we constructed NH1.0, a new reference genome from a northern individual, by combining the sequencing strategies of PacBio, 10× Genomics, and Bionano mapping. Using this integrated approach, we obtained an N50 scaffold size of 46.63 Mb for the NH1.0 genome and performed a comparative genome analysis of NH1.0 with YH and HX1. In order to generate a genomic variation map of Chinese populations, we performed the whole-genome sequencing of 597 participants and identified 24.85 million (M) single nucleotide variants (SNVs), 3.85 M small indels, and 106,382 structural variations. In the association analysis with collected phenotypes, we found that the T allele of rs1549293 in KAT8 significantly correlated with the waist circumference in northern Han males. Moreover, significant genetic diversity in MTHFR, TCN2, FADS1, and FADS2, which associate with circulating folate, vitamin B12, or lipid metabolism, was observed between northerners and southerners. Especially, for the homocysteine-increasing allele of rs1801133 (MTHFR 677T), we hypothesize that there exists a "comfort" zone for a high frequency of 677T between latitudes of 35-45 degree North. Taken together, our results provide a high-quality northern Han reference genome and novel population-specific data sets of genetic variants for use in the personalized and precision medicine.