1.Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information.
Fuyi LI ; Andre LEIER ; Quanzhong LIU ; Yanan WANG ; Dongxu XIANG ; Tatsuya AKUTSU ; Geoffrey I WEBB ; A Ian SMITH ; Tatiana MARQUEZ-LAGO ; Jian LI ; Jiangning SONG
Genomics, Proteomics & Bioinformatics 2020;18(1):52-64
Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.
2.Comprehensive functional annotation of susceptibility variants identifies genetic heterogeneity between lung adenocarcinoma and squamous cell carcinoma.
Na QIN ; Yuancheng LI ; Cheng WANG ; Meng ZHU ; Juncheng DAI ; Tongtong HONG ; Demetrius ALBANES ; Stephen LAM ; Adonina TARDON ; Chu CHEN ; Gary GOODMAN ; Stig E BOJESEN ; Maria Teresa LANDI ; Mattias JOHANSSON ; Angela RISCH ; H-Erich WICHMANN ; Heike BICKEBOLLER ; Gadi RENNERT ; Susanne ARNOLD ; Paul BRENNAN ; John K FIELD ; Sanjay SHETE ; Loic LE MARCHAND ; Olle MELANDER ; Hans BRUNNSTROM ; Geoffrey LIU ; Rayjean J HUNG ; Angeline ANDREW ; Lambertus A KIEMENEY ; Shan ZIENOLDDINY ; Kjell GRANKVIST ; Mikael JOHANSSON ; Neil CAPORASO ; Penella WOLL ; Philip LAZARUS ; Matthew B SCHABATH ; Melinda C ALDRICH ; Victoria L STEVENS ; Guangfu JIN ; David C CHRISTIANI ; Zhibin HU ; Christopher I AMOS ; Hongxia MA ; Hongbing SHEN
Frontiers of Medicine 2021;15(2):275-291
Although genome-wide association studies have identified more than eighty genetic variants associated with non-small cell lung cancer (NSCLC) risk, biological mechanisms of these variants remain largely unknown. By integrating a large-scale genotype data of 15 581 lung adenocarcinoma (AD) cases, 8350 squamous cell carcinoma (SqCC) cases, and 27 355 controls, as well as multiple transcriptome and epigenomic databases, we conducted histology-specific meta-analyses and functional annotations of both reported and novel susceptibility variants. We identified 3064 credible risk variants for NSCLC, which were overrepresented in enhancer-like and promoter-like histone modification peaks as well as DNase I hypersensitive sites. Transcription factor enrichment analysis revealed that USF1 was AD-specific while CREB1 was SqCC-specific. Functional annotation and gene-based analysis implicated 894 target genes, including 274 specifics for AD and 123 for SqCC, which were overrepresented in somatic driver genes (ER = 1.95, P = 0.005). Pathway enrichment analysis and Gene-Set Enrichment Analysis revealed that AD genes were primarily involved in immune-related pathways, while SqCC genes were homologous recombination deficiency related. Our results illustrate the molecular basis of both well-studied and new susceptibility loci of NSCLC, providing not only novel insights into the genetic heterogeneity between AD and SqCC but also a set of plausible gene targets for post-GWAS functional experiments.
Adenocarcinoma of Lung/genetics*
;
Carcinoma, Non-Small-Cell Lung/genetics*
;
Carcinoma, Squamous Cell/genetics*
;
Genetic Heterogeneity
;
Genetic Predisposition to Disease
;
Genome-Wide Association Study
;
Humans
;
Lung Neoplasms/genetics*
;
Polymorphism, Single Nucleotide