1.SFannotation: A Simple and Fast Protein Function Annotation System.
Genomics & Informatics 2014;12(2):76-78
Owing to the generation of vast amounts of sequencing data by using cost-effective, high-throughput sequencing technologies with improved computational approaches, many putative proteins have been discovered after assembly and structural annotation. Putative proteins are typically annotated using a functional annotation system that uses extant databases, but the expansive size of these databases often causes a bottleneck for rapid functional annotation. We developed SFannotation, a simple and fast functional annotation system that rapidly annotates putative proteins against four extant databases, Swiss-Prot, TIGRFAMs, Pfam, and the non-redundant sequence database, by using a best-hit approach with BLASTP and HMMSEARCH.
Computational Biology
;
Databases, Protein
;
Molecular Sequence Annotation
2.Evaluation of clustering algorithms for gene expression data using gene ontology annotations.
Chinese Medical Journal 2012;125(17):3048-3052
BACKGROUNDClustering is a useful exploratory technique for interpreting gene expression data to reveal groups of genes sharing common functional attributes. Biologists frequently face the problem of choosing an appropriate algorithm. We aimed to provide a standalone, easily accessible and biologically oriented criterion for expression data clustering evaluation.
METHODSAn external criterion utilizing annotation based similarities between genes is proposed in this work. Gene ontology information is employed as the annotation source. Comparisons among six widely used clustering algorithms over various types of gene expression data sets were carried out based on the criterion proposed.
RESULTSThe rank of these algorithms given by the criterion coincides with our common knowledge. Single-linkage has significantly poorer performance, even worse than the random algorithm. Ward's method archives the best performance in most cases.
CONCLUSIONSThe criterion proposed has a strong ability to distinguish among different clustering algorithms with different distance measurements. It is also demonstrated that analyzing main contributors of the criterion may offer some guidelines in finding local compact clusters. As an addition, we suggest using Ward's algorithm for gene expression data analysis.
Algorithms ; Cluster Analysis ; Gene Expression Profiling ; Humans ; Molecular Sequence Annotation
3.Characterization and phylogenetic analysis of complete chloroplast genome of cultivated Qinan agarwood.
Qiao-Zhen LIU ; Jiang-Peng DAI ; Peng-Jian ZHU ; Yue-Xia LIN ; Xiao-Xia GAO ; Shuang ZHU
China Journal of Chinese Materia Medica 2023;48(20):5531-5539
"Tangjie" leaves of cultivated Qinan agarwood were used to obtain the complete chloroplast genome using high-throughput sequencing technology. Combined with 12 chloroplast genomes of Aquilaria species downloaded from NCBI, bioinformatics method was employed to determine the chloroplast genome characteristics and phylogenetic relationships. The results showed that the chloroplast genome sequence length of cultivated Qinan agarwood "Tangjie" leaves was 174 909 bp with a GC content of 36.7%. A total of 136 genes were annotated, including 90 protein-coding genes, 38 tRNA genes, and 8 rRNA genes. Sequence repeat analysis detected 80 simple sequence repeats(SSRs) and 124 long sequence repeats, with most SSRs composed of A and T bases. Codon preference analysis revealed that AUU was the most frequently used codon, and codons with A and U endings were preferred. Comparative analysis of Aquilaria chloroplast genomes showed relative conservation of the IR region boundaries and identified five highly variable regions: trnD-trnY, trnT-trnL, trnF-ndhJ, petA-cemA, and rpl32, which could serve as potential DNA barcodes specific to the Aquilaria genus. Selection pressure analysis indicated positive selection in the rbcL, rps11, and rpl32 genes. Phylogenetic analysis revealed that cultivated Qinan agarwood "Tangjie" and Aquilaria agallocha clustered together(100% support), supporting the Chinese origin of Qinan agarwood from Aquilaria agallocha. The chloroplast genome data obtained in this study provide a foundation for studying the genetic diversity of cultivated Qinan agarwood and molecular identification of the Aquilaria genus.
Phylogeny
;
Genome, Chloroplast
;
Codon
;
Molecular Sequence Annotation
;
Thymelaeaceae/genetics*
4.Analysis of the chloroplast genome of Incarvillea younghusbandii Sprague.
Yaying ZHANG ; Wanyao JIAO ; Wenrui JIAO ; Tianle QIAO ; Zhiyang SU ; Shuo FENG
Chinese Journal of Biotechnology 2023;39(7):2954-2964
Incarvillea younghusbandii Sprague is a traditional tonic herb. The roots are used as herbal medicine for nourishing and strengthening, as well as treating postpartum milk deficiency and weakness. In this study, the chloroplast genome of I. younghusbandii was sequenced and assembled by the high-throughput sequencing technology. The sequence characteristics, sequence repeats, codon usage bias, phylogenetic relationships and estimated divergence time of I. younghusbandii were analyzed. The 159 323 bp sequence contained a large single copy (80 197 bp), a small single copy (9 030 bp) and two inverted repeat sequences (35 048 bp). It contained 120 genes, including 77 protein coding genes, 8 ribosomal RNA genes and 35 transfer RNA genes. AAA was the most frequent codon in the chloroplast coding sequence of I. younghusbandii. A total of 42 simple sequence repeats were identified in the chloroplast genome. Phylogenetic analysis revealed I. younghusbandii was mostly like its taxonomically close relative Incarvillea compacta. The divergence between I. younghusbandii and I. compacta was dated to 4.66 million years ago. This study was significant for the scientific conservation and development of resources related to I. compacta. It also provides a basic genetic resource for the subsequent species identification of the genus Incarvillea, and the population genetic diversity study of Bignoniaceae.
Phylogeny
;
Molecular Sequence Annotation
;
Genome, Chloroplast
;
Sequence Analysis, DNA
;
Whole Genome Sequencing
5.Progress in proteogenomics of prokaryotes.
Chengpu ZHANG ; Ping XU ; Yunping ZHU
Chinese Journal of Biotechnology 2014;30(7):1026-1035
With the rapid development of genome sequencing technologies, a large amount of prokaryote genomes have been sequenced in recent years. To further investigate the models and functions of genomes, the algorithms for genome annotations based on the sequence and homology features have been widely implemented to newly sequenced genomes. However, gene annotations only using the genomic information are prone to errors, such as the incorrect N-terminals and pseudogenes. It is even harder to provide reasonable annotating results in the case of the poor genome sequencing results. The transcriptomics based on the technologies such as microarray and RNA-seq and the proteomics based on the MS/MS have been used widely to identify the gene products with high throughput and high sensitivity, providing the powerful tools for the verification and correction of annotated genome. Compared with transcriptomics, proteomics can generate the protein list for the expressed genes in the samples or cells without any confusion of the non-coding RNA, leading the proteogenomics an important basis for the genome annotations in prokaryotes. In this paper, we first described the traditional genome annotation algorithms and pointed out the shortcomings. Then we summarized the advantages of proteomics in the genome annotations and reviewed the progress of proteogenomics in prokaryotes. Finally we discussed the challenges and strategies in the data analyses and potential solutions for the developments of proteogenomics.
Genomics
;
Molecular Sequence Annotation
;
Prokaryotic Cells
;
metabolism
;
Proteomics
;
Tandem Mass Spectrometry
6.Transcriptome Sequencing Analysis of Chrysomyia Megacephala Pupae in Different Growing Periods.
Qi Yan WANG ; Hong Ling ZHANG ; Zheng REN ; Yu Bo LIU ; Jing Yan JI ; Jiang HUANG
Journal of Forensic Medicine 2021;37(3):318-324
Objective To study the growth regulation, environmental adaption and epigenetic regulation of Chrysomyia Megacephala pupae, in order to obtain the transcriptome data of Chrysomyia Megacephala in different growing periods, and lay the foundation for forensic application. Methods The Chrysomyia Megacephala was cultivated and after pupation, 3 pupae were collected every 24 h from pupation to emergence, and stored at -80 ℃ for later use. High-throughput sequencing was performed by Illumina Hiseq 4000 and Unigenes were obtained. The Unigenes were compared by comparison tool BLAST from NCBI in databases such as NR, STRING, SWISS-PROT (including Pfam), GO, COG, KEGG in order to obtain the corresponding annotation information. The expression amount of Unigenes obtained by sequencing in Chrysomyia Megacephala in six different growing periods was calculated by FPKM method, and the discrepant genes were screened according to the following standards: the log2 multiple absolute value of FPKM expression amount between two different growing periods must be larger than 1 (log2|FC|>1), and the false discovery rate must be less than 0.05. Results When the mean temperature was 25.6 ℃, Chrysomyia Megacephala emerged 6 d after they pupated. A total of 43 408 pieces of Unigenes were obtained and their mean length was 905 bp, of which 32 500, 18 720, 13 542, 9 191 and 18 720 pieces were annotated by NR, SWISS-PORT, Pfam, STRING and KEGG databases. According to the discrepant gene analysis of pupae in two different growing periods, the number of genes with variants ranged from 801 to 5 307, and the total number of discrepant genes was 45 676. Conclusion The gene expressions of the transcriptome data of Chrysomyia Megacephala pupae in different growing periods are different. The results provided a good foundation for further research on the transcriptome changes in each period of the pupae of sarcosaprophagous flies and provided the basis for exploring the genes associated with the growth of Chrysomyia Megacephala pupae.
Animals
;
Epigenesis, Genetic
;
Gene Expression Profiling
;
High-Throughput Nucleotide Sequencing
;
Molecular Sequence Annotation
;
Pupa/genetics*
;
Transcriptome
7.AnsNGS: An Annotation System to Sequence Variations of Next Generation Sequencing Data for Disease-Related Phenotypes.
Young Ji NA ; Yonglae CHO ; Ju Han KIM
Healthcare Informatics Research 2013;19(1):50-55
OBJECTIVES: Next-generation sequencing (NGS) data in the identification of disease-causing genes provides a promising opportunity in the diagnosis of disease. Beyond the previous efforts for NGS data alignment, variant detection, and visualization, developing a comprehensive annotation system supported by multiple layers of disease phenotype-related databases is essential for deciphering the human genome. To satisfy the impending need to decipher the human genome, it is essential to develop a comprehensive annotation system supported by multiple layers of disease phenotype-related databases. METHODS: AnsNGS (Annotation system of sequence variations for next-generation sequencing data) is a tool for contextualizing variants related to diseases and examining their functional consequences. The AnsNGS integrates a variety of annotation databases to attain multiple levels of annotation. RESULTS: The AnsNGS assigns biological functions to variants, and provides gene (or disease)-centric queries for finding disease-causing variants. The AnsNGS also connects those genes harbouring variants and the corresponding expression probes for downstream analysis using expression microarrays. Here, we demonstrate its ability to identify disease-related variants in the human genome. CONCLUSIONS: The AnsNGS can give a key insight into which of these variants is already known to be involved in a disease-related phenotype or located in or near a known regulatory site. The AnsNGS is available free of charge to academic users and can be obtained from http://snubi.org/software/AnsNGS/.
Fees and Charges
;
Genome, Human
;
Genomic Structural Variation
;
High-Throughput Nucleotide Sequencing
;
Humans
;
Molecular Sequence Annotation
;
Phenotype
;
Sequence Analysis, DNA
8.Prediction and bioinformatics analysis of human gene expression profiling regulated by amifostine.
Bo YANG ; Li-Li CAI ; Xiao-Hua CHI ; Xue-Chun LU ; Feng ZHANG ; Shuai TUO ; Hong-Li ZHU ; Li-Hong LIU ; Jiang-Wei YAN ; Chao-Wei TUO
Journal of Experimental Hematology 2011;19(3):711-716
Objective of this study was to perform bioinformatics analysis of the characteristics of gene expression profiling regulated by amifostine and predict its novel potential biological function to provide a direction for further exploring pharmacological actions of amifostine and study methods. Amifostine was used as a key word to search internet-based free gene expression database including GEO, affymetrix gene chip database, GenBank, SAGE, GeneCard, InterPro, ProtoNet, UniProt and BLOCKS and the sifted amifostine-regulated gene expression profiling data was subjected to validity testing, gene expression difference analysis and functional clustering and gene annotation. The results showed that only one data of gene expression profiling regulated by amifostine was sifted from GEO database (accession: GSE3212). Through validity testing and gene expression difference analysis, significant difference (p < 0.01) was only found in 2.14% of the whole genome (460/192000). Gene annotation analysis showed that 139 out of 460 genes were known genes, in which 77 genes were up-regulated and 62 genes were down-regulated. 13 out of 139 genes were newly expressed following amifostine treatment of K562 cells, however expression of 5 genes was completely inhibited. Functional clustering displayed that 139 genes were divided into 11 categories and their biological function was involved in hematopoietic and immunologic regulation, apoptosis and cell cycle. It is concluded that bioinformatics method can be applied to analysis of gene expression profiling regulated by amifostine. Amifostine has a regulatory effect on human gene expression profiling and this action is mainly presented in biological processes including hematopoiesis, immunologic regulation, apoptosis and cell cycle and so on. The effect of amifostine on human gene expression need to be further testified in experimental condition.
Amifostine
;
pharmacology
;
Computational Biology
;
Gene Expression
;
drug effects
;
Gene Expression Profiling
;
methods
;
Humans
;
Microarray Analysis
;
Molecular Sequence Annotation
9.CIRCpedia v2: An Updated Database for Comprehensive Circular RNA Annotation and Expression Comparison.
Rui DONG ; Xu-Kai MA ; Guo-Wei LI ; Li YANG
Genomics, Proteomics & Bioinformatics 2018;16(4):226-233
Circular RNAs (circRNAs) from back-splicing of exon(s) have been recently identified to be broadly expressed in eukaryotes, in tissue- and species-specific manners. Although functions of most circRNAs remain elusive, some circRNAs are shown to be functional in gene expression regulation and potentially relate to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. Profiling circRNAs by integrating their expression among different samples thus provides molecular basis for further functional study of circRNAs and their potential application in clinic. Here, we report CIRCpedia v2, an updated database for comprehensive circRNA annotation from over 180 RNA-seq datasets across six different species. This atlas allows users to search, browse, and download circRNAs with expression features in various cell types/tissues, including disease samples. In addition, the updated database incorporates conservation analysis of circRNAs between humans and mice. Finally, the web interface also contains computational tools to compare circRNA expression among samples. CIRCpedia v2 is accessible at http://www.picb.ac.cn/rnomics/circpedia.
Animals
;
Databases, Genetic
;
Gene Expression Regulation
;
Humans
;
Internet
;
Mice
;
Molecular Sequence Annotation
;
RNA
;
genetics
;
User-Computer Interface
10.Blood transcriptome resources of chinstrap (Pygoscelis antarcticus) and gentoo (Pygoscelis papua) penguins from the South Shetland Islands, Antarctica
Bo Mi KIM ; Jihye JEONG ; Euna JO ; Do Hwan AHN ; Jeong Hoon KIM ; Jae Sung RHEE ; Hyun PARK
Genomics & Informatics 2019;17(1):e5-
The chinstrap (Pygoscelis antarcticus) and gentoo (P. papua) penguins are distributed throughout Antarctica and the sub-Antarctic islands. In this study, high-quality de novo assemblies of blood transcriptomes from these penguins were generated using the Illumina MiSeq platform. A total of 22.2 and 21.8 raw reads were obtained from chinstrap and gentoo penguins, respectively. These reads were assembled using the Oases assembly platform and resulted in 26,036 and 21,854 contigs with N50 values of 929 and 933 base pairs, respectively. Functional gene annotations through pathway analyses of the Gene Ontology, EuKaryotic Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were performed for each blood transcriptome, resulting in a similar compositional order between the two transcriptomes. Ortholog comparisons with previously published transcriptomes from the Adélie (P. adeliae) and emperor (Aptenodytes forsteri) penguins revealed that a high proportion of the four penguins’ transcriptomes had significant sequence homology. Because blood and tissues of penguins have been used to monitor pollution in Antarctica, immune parameters in blood could be important indicators for understanding the health status of penguins and other Antarctic animals. In the blood transcriptomes, KEGG analyses detected many essential genes involved in the major innate immunity pathways, which are key metabolic pathways for maintaining homeostasis against exogenous infections or toxins. Blood transcriptome studies such as this may be useful for checking the immune and health status of penguins without sacrifice.
Animals
;
Base Pairing
;
Gene Ontology
;
Genes, Essential
;
Genome
;
Homeostasis
;
Immunity, Innate
;
Islands
;
Metabolic Networks and Pathways
;
Molecular Sequence Annotation
;
Sequence Homology
;
Spheniscidae
;
Transcriptome