1.Gene print-based cell subtypes annotation of human disease across heterogeneous datasets with gPRINT.
Ruojin YAN ; Chunmei FAN ; Shen GU ; Tingzhang WANG ; Zi YIN ; Xiao CHEN
Protein & Cell 2025;16(8):685-704
Identification of disease-specific cell subtypes (DSCSs) has profound implications for understanding disease mechanisms, preoperative diagnosis, and precision therapy. However, achieving unified annotation of DSCSs in heterogeneous single-cell datasets remains a challenge. In this study, we developed the gPRINT algorithm (generalized approach for cell subtype identification with single cell's voicePRINT). Inspired by the principles of speech recognition in noisy environments, gPRINT transforms gene position and gene expression information into voiceprints based on ordered and clustered gene expression phenomena, obtaining unique "gene print" patterns for each cell. Then, we integrated neural networks to mitigate the impact of background noise on cell identity label mapping. We demonstrated the reproducibility of gPRINT across different donors, single-cell sequencing platforms, and disease subtypes, and its utility for automatic cell subtype annotation across datasets. Moreover, gPRINT achieved higher annotation accuracy of 98.37% when externally validated based on the same tissue, surpassing other algorithms. Furthermore, this approach has been applied to fibrosis-associated diseases in multiple tissues throughout the body, as well as to the annotation of fibroblast subtypes in a single tissue, tendon, where fibrosis is prevalent. We successfully achieved automatic prediction of tendinopathy-specific cell subtypes, key targets, and related drugs. In summary, gPRINT provides an automated and unified approach for identifying DSCSs across datasets, facilitating the elucidation of specific cell subtypes under different disease states and providing a powerful tool for exploring therapeutic targets in diseases.
Humans
;
Algorithms
;
Single-Cell Analysis
;
Databases, Genetic
;
Molecular Sequence Annotation
2.Characterization and phylogenetic analysis of complete chloroplast genome of cultivated Qinan agarwood.
Qiao-Zhen LIU ; Jiang-Peng DAI ; Peng-Jian ZHU ; Yue-Xia LIN ; Xiao-Xia GAO ; Shuang ZHU
China Journal of Chinese Materia Medica 2023;48(20):5531-5539
"Tangjie" leaves of cultivated Qinan agarwood were used to obtain the complete chloroplast genome using high-throughput sequencing technology. Combined with 12 chloroplast genomes of Aquilaria species downloaded from NCBI, bioinformatics method was employed to determine the chloroplast genome characteristics and phylogenetic relationships. The results showed that the chloroplast genome sequence length of cultivated Qinan agarwood "Tangjie" leaves was 174 909 bp with a GC content of 36.7%. A total of 136 genes were annotated, including 90 protein-coding genes, 38 tRNA genes, and 8 rRNA genes. Sequence repeat analysis detected 80 simple sequence repeats(SSRs) and 124 long sequence repeats, with most SSRs composed of A and T bases. Codon preference analysis revealed that AUU was the most frequently used codon, and codons with A and U endings were preferred. Comparative analysis of Aquilaria chloroplast genomes showed relative conservation of the IR region boundaries and identified five highly variable regions: trnD-trnY, trnT-trnL, trnF-ndhJ, petA-cemA, and rpl32, which could serve as potential DNA barcodes specific to the Aquilaria genus. Selection pressure analysis indicated positive selection in the rbcL, rps11, and rpl32 genes. Phylogenetic analysis revealed that cultivated Qinan agarwood "Tangjie" and Aquilaria agallocha clustered together(100% support), supporting the Chinese origin of Qinan agarwood from Aquilaria agallocha. The chloroplast genome data obtained in this study provide a foundation for studying the genetic diversity of cultivated Qinan agarwood and molecular identification of the Aquilaria genus.
Phylogeny
;
Genome, Chloroplast
;
Codon
;
Molecular Sequence Annotation
;
Thymelaeaceae/genetics*
3.Analysis of the chloroplast genome of Incarvillea younghusbandii Sprague.
Yaying ZHANG ; Wanyao JIAO ; Wenrui JIAO ; Tianle QIAO ; Zhiyang SU ; Shuo FENG
Chinese Journal of Biotechnology 2023;39(7):2954-2964
Incarvillea younghusbandii Sprague is a traditional tonic herb. The roots are used as herbal medicine for nourishing and strengthening, as well as treating postpartum milk deficiency and weakness. In this study, the chloroplast genome of I. younghusbandii was sequenced and assembled by the high-throughput sequencing technology. The sequence characteristics, sequence repeats, codon usage bias, phylogenetic relationships and estimated divergence time of I. younghusbandii were analyzed. The 159 323 bp sequence contained a large single copy (80 197 bp), a small single copy (9 030 bp) and two inverted repeat sequences (35 048 bp). It contained 120 genes, including 77 protein coding genes, 8 ribosomal RNA genes and 35 transfer RNA genes. AAA was the most frequent codon in the chloroplast coding sequence of I. younghusbandii. A total of 42 simple sequence repeats were identified in the chloroplast genome. Phylogenetic analysis revealed I. younghusbandii was mostly like its taxonomically close relative Incarvillea compacta. The divergence between I. younghusbandii and I. compacta was dated to 4.66 million years ago. This study was significant for the scientific conservation and development of resources related to I. compacta. It also provides a basic genetic resource for the subsequent species identification of the genus Incarvillea, and the population genetic diversity study of Bignoniaceae.
Phylogeny
;
Molecular Sequence Annotation
;
Genome, Chloroplast
;
Sequence Analysis, DNA
;
Whole Genome Sequencing
4.Transcriptome Sequencing Analysis of Chrysomyia Megacephala Pupae in Different Growing Periods.
Qi Yan WANG ; Hong Ling ZHANG ; Zheng REN ; Yu Bo LIU ; Jing Yan JI ; Jiang HUANG
Journal of Forensic Medicine 2021;37(3):318-324
Objective To study the growth regulation, environmental adaption and epigenetic regulation of Chrysomyia Megacephala pupae, in order to obtain the transcriptome data of Chrysomyia Megacephala in different growing periods, and lay the foundation for forensic application. Methods The Chrysomyia Megacephala was cultivated and after pupation, 3 pupae were collected every 24 h from pupation to emergence, and stored at -80 ℃ for later use. High-throughput sequencing was performed by Illumina Hiseq 4000 and Unigenes were obtained. The Unigenes were compared by comparison tool BLAST from NCBI in databases such as NR, STRING, SWISS-PROT (including Pfam), GO, COG, KEGG in order to obtain the corresponding annotation information. The expression amount of Unigenes obtained by sequencing in Chrysomyia Megacephala in six different growing periods was calculated by FPKM method, and the discrepant genes were screened according to the following standards: the log2 multiple absolute value of FPKM expression amount between two different growing periods must be larger than 1 (log2|FC|>1), and the false discovery rate must be less than 0.05. Results When the mean temperature was 25.6 ℃, Chrysomyia Megacephala emerged 6 d after they pupated. A total of 43 408 pieces of Unigenes were obtained and their mean length was 905 bp, of which 32 500, 18 720, 13 542, 9 191 and 18 720 pieces were annotated by NR, SWISS-PORT, Pfam, STRING and KEGG databases. According to the discrepant gene analysis of pupae in two different growing periods, the number of genes with variants ranged from 801 to 5 307, and the total number of discrepant genes was 45 676. Conclusion The gene expressions of the transcriptome data of Chrysomyia Megacephala pupae in different growing periods are different. The results provided a good foundation for further research on the transcriptome changes in each period of the pupae of sarcosaprophagous flies and provided the basis for exploring the genes associated with the growth of Chrysomyia Megacephala pupae.
Animals
;
Epigenesis, Genetic
;
Gene Expression Profiling
;
High-Throughput Nucleotide Sequencing
;
Molecular Sequence Annotation
;
Pupa/genetics*
;
Transcriptome
5.Blood transcriptome resources of chinstrap (Pygoscelis antarcticus) and gentoo (Pygoscelis papua) penguins from the South Shetland Islands, Antarctica
Bo Mi KIM ; Jihye JEONG ; Euna JO ; Do Hwan AHN ; Jeong Hoon KIM ; Jae Sung RHEE ; Hyun PARK
Genomics & Informatics 2019;17(1):e5-
The chinstrap (Pygoscelis antarcticus) and gentoo (P. papua) penguins are distributed throughout Antarctica and the sub-Antarctic islands. In this study, high-quality de novo assemblies of blood transcriptomes from these penguins were generated using the Illumina MiSeq platform. A total of 22.2 and 21.8 raw reads were obtained from chinstrap and gentoo penguins, respectively. These reads were assembled using the Oases assembly platform and resulted in 26,036 and 21,854 contigs with N50 values of 929 and 933 base pairs, respectively. Functional gene annotations through pathway analyses of the Gene Ontology, EuKaryotic Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were performed for each blood transcriptome, resulting in a similar compositional order between the two transcriptomes. Ortholog comparisons with previously published transcriptomes from the Adélie (P. adeliae) and emperor (Aptenodytes forsteri) penguins revealed that a high proportion of the four penguins’ transcriptomes had significant sequence homology. Because blood and tissues of penguins have been used to monitor pollution in Antarctica, immune parameters in blood could be important indicators for understanding the health status of penguins and other Antarctic animals. In the blood transcriptomes, KEGG analyses detected many essential genes involved in the major innate immunity pathways, which are key metabolic pathways for maintaining homeostasis against exogenous infections or toxins. Blood transcriptome studies such as this may be useful for checking the immune and health status of penguins without sacrifice.
Animals
;
Base Pairing
;
Gene Ontology
;
Genes, Essential
;
Genome
;
Homeostasis
;
Immunity, Innate
;
Islands
;
Metabolic Networks and Pathways
;
Molecular Sequence Annotation
;
Sequence Homology
;
Spheniscidae
;
Transcriptome
6.CIRCpedia v2: An Updated Database for Comprehensive Circular RNA Annotation and Expression Comparison.
Rui DONG ; Xu-Kai MA ; Guo-Wei LI ; Li YANG
Genomics, Proteomics & Bioinformatics 2018;16(4):226-233
Circular RNAs (circRNAs) from back-splicing of exon(s) have been recently identified to be broadly expressed in eukaryotes, in tissue- and species-specific manners. Although functions of most circRNAs remain elusive, some circRNAs are shown to be functional in gene expression regulation and potentially relate to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. Profiling circRNAs by integrating their expression among different samples thus provides molecular basis for further functional study of circRNAs and their potential application in clinic. Here, we report CIRCpedia v2, an updated database for comprehensive circRNA annotation from over 180 RNA-seq datasets across six different species. This atlas allows users to search, browse, and download circRNAs with expression features in various cell types/tissues, including disease samples. In addition, the updated database incorporates conservation analysis of circRNAs between humans and mice. Finally, the web interface also contains computational tools to compare circRNA expression among samples. CIRCpedia v2 is accessible at http://www.picb.ac.cn/rnomics/circpedia.
Animals
;
Databases, Genetic
;
Gene Expression Regulation
;
Humans
;
Internet
;
Mice
;
Molecular Sequence Annotation
;
RNA
;
genetics
;
User-Computer Interface
7.SPORTS1.0: A Tool for Annotating and Profiling Non-coding RNAs Optimized for rRNA- and tRNA-derived Small RNAs.
Junchao SHI ; Eun-A KO ; Kenton M SANDERS ; Qi CHEN ; Tong ZHOU
Genomics, Proteomics & Bioinformatics 2018;16(2):144-151
High-throughput RNA-seq has revolutionized the process of small RNA (sRNA) discovery, leading to a rapid expansion of sRNA categories. In addition to the previously well-characterized sRNAs such as microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), and small nucleolar RNA (snoRNAs), recent emerging studies have spotlighted on tRNA-derived sRNAs (tsRNAs) and rRNA-derived sRNAs (rsRNAs) as new categories of sRNAs that bear versatile functions. Since existing software and pipelines for sRNA annotation are mostly focused on analyzing miRNAs or piRNAs, here we developed the sRNA annotation pipelineoptimized for rRNA- and tRNA-derived sRNAs (SPORTS1.0). SPORTS1.0 is optimized for analyzing tsRNAs and rsRNAs from sRNA-seq data, in addition to its capacity to annotate canonical sRNAs such as miRNAs and piRNAs. Moreover, SPORTS1.0 can predict potential RNA modification sites based on nucleotide mismatches within sRNAs. SPORTS1.0 is precompiled to annotate sRNAs for a wide range of 68 species across bacteria, yeast, plant, and animal kingdoms, while additional species for analyses could be readily expanded upon end users' input. For demonstration, by analyzing sRNA datasets using SPORTS1.0, we reveal that distinct signatures are present in tsRNAs and rsRNAs from different mouse cell types. We also find that compared to other sRNA species, tsRNAs bear the highest mismatch rate, which is consistent with their highly modified nature. SPORTS1.0 is an open-source software and can be publically accessed at https://github.com/junchaoshi/sports1.0.
Animals
;
Gene Expression Profiling
;
High-Throughput Nucleotide Sequencing
;
Mice
;
MicroRNAs
;
chemistry
;
metabolism
;
Molecular Sequence Annotation
;
RNA, Ribosomal
;
chemistry
;
metabolism
;
RNA, Small Interfering
;
chemistry
;
metabolism
;
RNA, Small Untranslated
;
chemistry
;
metabolism
;
RNA, Transfer
;
chemistry
;
metabolism
;
Sequence Analysis, RNA
;
methods
;
Software
8.Assembling of an ammonium transporter gene in Salicornia europaea by expression pattern analysis of Unigene in transcriptome.
Xinlong XIAO ; Xuan ZHANG ; Xiaomeng WU ; Jinbiao MA ; Yin'an YAO
Chinese Journal of Biotechnology 2014;30(11):1763-1773
RNA-seq can help us quickly obtain the whole transcriptome sequences of species under different conditions. Many Unigenes that are assembled by raw reads always do not contain complete open reading frame (ORF). In addition, it also has some redundancy in transcriptome library. Some Unigenes in the library, although belong to one transcript, cannot be assembled without overlapping. We found five incomplete Unigenes annotated ammonium transporter (AMT) from Salicornia europaea transcriptome, in which two Unigenes (Uni4 and Uni5) had identical expression patterns across four transcriptomes. The two Unigenes may come from one transcript. Analyzing the Unigene position of transcript by NCBI blastx, we found that Uni4 and Uni5 respectively located in 5' end and 3' end compared with the reference transcript, and an unknown gap of 120 bp may exist in a hypothetic transcript to which Uni4 and Uni5 both belong. To verify the hypothesis, single forward primer and single reverse primers were respectively designed on Uni4 and Uni5, and a fragment with about 800 bp was generated by PCR. Then it was sequenced and aligned with Uni4 and Uni5. Finally, we assembled a sequence with 1 667 bp, which contains a complete ORF (1 482 bp, coding 494 amino acids). It belongs to amt1 subfamily and was named Seamt1 via the phylogenetic analysis. It was pointed by bioinformatics tools that SeAMT1 protein conformed to the AMT characteristics of other species. This work clustered expression pattern to explore the Unigenes of one transcript, and the feasibility of this method was validated through the other two groups of Unigenes. The handy method will benefit extension and assembling of Unigene in transcriptome, it also helps achieve the complete ORF and gene function.
Ammonium Compounds
;
Chenopodiaceae
;
genetics
;
Computational Biology
;
Gene Expression Profiling
;
Genes, Plant
;
Membrane Transport Proteins
;
genetics
;
Molecular Sequence Annotation
;
Open Reading Frames
;
Phylogeny
;
Plant Proteins
;
genetics
;
Transcriptome
9.SFannotation: A Simple and Fast Protein Function Annotation System.
Genomics & Informatics 2014;12(2):76-78
Owing to the generation of vast amounts of sequencing data by using cost-effective, high-throughput sequencing technologies with improved computational approaches, many putative proteins have been discovered after assembly and structural annotation. Putative proteins are typically annotated using a functional annotation system that uses extant databases, but the expansive size of these databases often causes a bottleneck for rapid functional annotation. We developed SFannotation, a simple and fast functional annotation system that rapidly annotates putative proteins against four extant databases, Swiss-Prot, TIGRFAMs, Pfam, and the non-redundant sequence database, by using a best-hit approach with BLASTP and HMMSEARCH.
Computational Biology
;
Databases, Protein
;
Molecular Sequence Annotation
10.Progress in proteogenomics of prokaryotes.
Chengpu ZHANG ; Ping XU ; Yunping ZHU
Chinese Journal of Biotechnology 2014;30(7):1026-1035
With the rapid development of genome sequencing technologies, a large amount of prokaryote genomes have been sequenced in recent years. To further investigate the models and functions of genomes, the algorithms for genome annotations based on the sequence and homology features have been widely implemented to newly sequenced genomes. However, gene annotations only using the genomic information are prone to errors, such as the incorrect N-terminals and pseudogenes. It is even harder to provide reasonable annotating results in the case of the poor genome sequencing results. The transcriptomics based on the technologies such as microarray and RNA-seq and the proteomics based on the MS/MS have been used widely to identify the gene products with high throughput and high sensitivity, providing the powerful tools for the verification and correction of annotated genome. Compared with transcriptomics, proteomics can generate the protein list for the expressed genes in the samples or cells without any confusion of the non-coding RNA, leading the proteogenomics an important basis for the genome annotations in prokaryotes. In this paper, we first described the traditional genome annotation algorithms and pointed out the shortcomings. Then we summarized the advantages of proteomics in the genome annotations and reviewed the progress of proteogenomics in prokaryotes. Finally we discussed the challenges and strategies in the data analyses and potential solutions for the developments of proteogenomics.
Genomics
;
Molecular Sequence Annotation
;
Prokaryotic Cells
;
metabolism
;
Proteomics
;
Tandem Mass Spectrometry

Result Analysis
Print
Save
E-mail