1.Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
Yuna LEE ; Kiejung PARK ; Insong KOH
Genomics & Informatics 2019;17(4):40-
While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.
Genome
;
Genome, Human
;
Humans
;
Nanopores
;
Polymorphism, Single Nucleotide
2.Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
Yuna LEE ; Kiejung PARK ; Insong KOH
Genomics & Informatics 2019;17(4):e40-
While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.
3.The Effect of Increasing Control-to-case Ratio on Statistical Power in a Simulated Case-control SNP Association Study.
Moonsu KANG ; Sunhee CHOI ; InSong KOH
Genomics & Informatics 2009;7(3):148-151
Generally, larger sample size leads to a greater statistical power to detect a significant difference. We may increase the sample size for both case and control in order to obtain greater power. However, it is often the case that increasing sample size for case is not feasible for a variety of reasons. In order to look at change in power as the ratio of control to case varies (1:1 to 4:1), we conduct association tests with simulated data generated by PLINK. The simulated data consist of 50 disease SNPs and 300 non-disease SNPs and we compute powers for disease SNPs. Genetic Power Calculator was used for computing powers with varying the ratio of control to case (1:1, 2:1, 3:1, 4:1). In this study, we show that gains in statistical power resulting from increasing the ratio of control to case are substantial for the simulated data. Similar results might be expected for real data.
Case-Control Studies
;
Polymorphism, Single Nucleotide
;
Sample Size
4.Brain Region-Dependent Alternative Splicing of Alzheimer Disease (AD)-Risk Genes Is Associated With Neuropathological Features in AD
Sara KIM ; Seonggyun HAN ; Soo-ah CHO ; Kwangsik NHO ; Insong KOH ; Younghee LEE
International Neurourology Journal 2022;26(Suppl 2):S126-136
Purpose:
Alzheimer disease (AD) is one of the most complex diseases and is characterized by AD-related neuropathological features, including accumulation of amyloid-β plaques and tau neurofibrillary tangles. Dysregulation of alternative splicing (AS) contributes to these features, and there is heterogeneity in features across brain regions between AD patients, leading to different severity and progression rates; however, brain region-specific AS mechanisms still remain unclear. Therefore, we aimed to systemically investigate AS in multiple brain regions of AD patients and how they affect clinical features.
Methods:
We analyzed RNA sequencing (RNA-Seq) data obtained from brain regions (frontal and temporal) of AD patients. Reads were mapped to the hg19 reference genome using the STAR aligner, and exon skipping (ES) rates were estimated as percent spliced in (PSI) by rMATs. We focused on AD-risk genes discovered by genome-wide association studies, and accordingly evaluated associations between PSI of skipped exons in AD-risk genes and Braak stage and plaque density mean (PM) for each brain region. We also integrated whole-genome sequencing data of the ascertained samples with RNA-Seq data to identify genetic regulators of feature-associated ES.
Results:
We identified 26 and 41 ES associated with Braak stage in frontal and temporal regions, respectively, and 10 and 50 ES associated with PM. Among those, 10 were frontal-specific (CLU and NTRK2), 65 temporal-specific (HIF1A and TRPC4AP), and 26 shared ES (APP) that accompanied functional Gene Ontology terms, including axonogenesis in shared-ES genes. We further identified genetic regulators that account for 44 ES (44% of the total). Finally, we present as a case study the systematic regulation of an ES in APP, which is important in AD pathogenesis.
Conclusions
This study provides new insights into brain region-dependent AS regulation of the architecture of AD-risk genes that contributes to AD pathologies, ultimately allowing identification of a treatment target and region-specific biomarkers for AD.
5.MediScore: MEDLINE-based Interactive Scoring of Gene and Disease Associations.
Hye Young CHO ; Bermseok OH ; Jong Keuk LEE ; Kuchan KIMM ; InSong KOH
Genomics & Informatics 2004;2(3):131-133
MediScore is an information retrieval system, which helps to search for the set of genes associated with a specific disease or the set of diseases associated with a specific gene. Despite recent improvement of natural language processing (NLP) and other text mining approaches to search for disease associated genes, many false positive results come out due to diversity of exceptional cases as well as ambiguities in gene names. In order to overcome the weak points of current text mining approaches, MediScore introduces statistical normalization based on binomial to normal distribution approximation which corrects inaccurate scores caused by common words not representing genes and interactive rescoring by the user to remove the false positive results. Interactive rescoring includes individual alias scoring for each gene to remove false gene synonyms, referring MEDLINE abstracts, and cross referencing between OMIM and other related information.
Data Mining
;
Databases, Genetic
;
Information Systems
;
Natural Language Processing
6.Comparative Statistic Module (CSM) for Significant Gene Selection.
Young Jin KIM ; Hyo Mi KIM ; Sang Bae KIM ; Chan PARK ; Kuchan KIMM ; InSong KOH
Genomics & Informatics 2004;2(4):180-183
Comparative Statistic Module(CSM) provides more reliable list of significant genes to genomics researchers by offering the commonly selected genes and a method of choice by calculating the rank of each statistical test based on the average ranking of common genes across the five statistical methods, i.e. t-test, Kruskal-Wallis (Wilcoxon signed rank) test, SAM, two sample multiple test, and Empirical Bayesian test. This statistical analysis module is implemented in Perl, and R languages.
Genomics
7.HapAnalyzer: Minimum Haplotype Analysis System for Association Studies.
Ho Youl JUNG ; Jung Sun PARK ; Yun Ju PARK ; Young Jin KIM ; Kuchan KIMM ; In Song KOH
Genomics & Informatics 2004;2(2):107-109
SUMMARY: HapAnalyzer is an analysis system that provides minimum analysis methods for the SNP-based association studies. It consists of Hardy-Weinberg equilibrium (HWE) test, linkage disequilibrium (LD) computation, haplotype reconstruction, and SNP (or haplotype)-phenotype association assessment. It is well suited to a case-control association study for the unrelated population.
Case-Control Studies
;
Haplotypes*
;
Linkage Disequilibrium
8.Polymorphisms of ATF6B Are Potentially Associated With FEV1 Decline by Aspirin Provocation in Asthmatics.
Tae Joon PARK ; Jeong Hyun KIM ; Charisse F PASAJE ; Byung Lae PARK ; Joon Seol BAE ; Soo Taek UH ; Yong Hoon KIM ; Mi Kyeong KIM ; Inseon S CHOI ; Byoung Whui CHOI ; Hye Rim SHIN ; Jong Sook PARK ; Insong KOH ; Choon Sik PARK ; Hyoung Doo SHIN
Allergy, Asthma & Immunology Research 2014;6(2):142-148
PURPOSE: Endoplasmic reticulum (ER) stress has recently been observed to activate NF-kappaB and induce inflammatory responses such as asthma. Activating transcription factor 6beta (ATF6B) is known to regulate ATFalpha-mediated ER stress response. The aim of this study is to investigate the associations of ATF6B genetic variants with aspirin-exacerbated respiratory disease (AERD) and its major phenotype, % decline of FEV1 by aspirin provocation. METHODS: Four common single nucleotide polymorphisms (SNPs) of ATF6B were genotyped and statistically analyzed in 93 AERD patients and 96 aspirin-tolerant asthma (ATA) as controls. RESULTS: Logistic analysis revealed that 2 SNPs (rs2228628 and rs8111, P=0.008; corrected P=0.03) and 1 haplotype (ATF6B-ht4, P=0.005; corrected P=0.02) were significantly associated with % decline of FEV1 by aspirin provocation, whereas ATF6B polymorphisms and haplotypes were not associated with the risk of AERD. CONCLUSIONS: Although further functional and replication studies are needed, our preliminary findings suggest that ATF6B may be related to obstructive phenotypes in response to aspirin exposure in adult asthmatics.
Adult
;
Aspirin*
;
Asthma
;
Endoplasmic Reticulum
;
Haplotypes
;
Humans
;
Methods
;
NF-kappa B
;
Phenotype
;
Polymorphism, Single Nucleotide
;
Transcription Factors
9.RNA-Seq for Gene Expression Profiling of Human Necrotizing Enterocolitis: a Pilot Study.
Kyuwhan JUNG ; InSong KOH ; Jeong Hyun KIM ; Hyun Sub CHEONG ; Taejin PARK ; So Hyun NAM ; Soo Min JUNG ; Cherry Ann SIO ; Su Yeong KIM ; Euiseok JUNG ; Byoungkook LEE ; Hye Rim KIM ; Eun SHIN ; Sung Eun JUNG ; Chang Won CHOI ; Beyong Il KIM ; Eunyoung JUNG ; Hyoung Doo SHIN
Journal of Korean Medical Science 2017;32(5):817-824
Necrotizing enterocolitis (NEC) characterized by inflammatory intestinal necrosis is a major cause of mortality and morbidity in newborns. Deep RNA sequencing (RNA-Seq) has recently emerged as a powerful technology enabling better quantification of gene expression than microarrays with a lower background signal. A total of 10 transcriptomes from 5 pairs of NEC lesions and adjacent normal tissues obtained from preterm infants with NEC were analyzed. As a result, a total of 65 genes (57 down-regulated and 8 up-regulated) revealed significantly different expression levels in the NEC lesion compared to the adjacent normal region, based on a significance at fold change ≥ 1.5 and P ≤ 0.05. The most significant gene, DPF3 (P < 0.001), has recently been reported to have differential expressions in colon segments. Our gene ontology analysis between NEC lesion and adjacent normal tissues showed that down-regulated genes were included in nervous system development with the most significance (P = 9.3 × 10⁻⁷; P(corr) = 0.0003). In further pathway analysis using Pathway Express based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, genes involved in thyroid cancer and axon guidance were predicted to be associated with different expression (P(corr) = 0.008 and 0.020, respectively). Although further replications using a larger sample size and functional evaluations are needed, our results suggest that altered gene expression and the genes' involved functional pathways and categories may provide insight into NEC development and aid in future research.
Axons
;
Colon
;
Enterocolitis, Necrotizing*
;
Gene Expression Profiling*
;
Gene Expression*
;
Gene Ontology
;
Genome
;
Humans*
;
Infant, Newborn
;
Infant, Premature
;
Mortality
;
Necrosis
;
Nervous System
;
Pilot Projects*
;
Sample Size
;
Sequence Analysis, RNA
;
Thyroid Neoplasms
;
Transcriptome
10.Interaction Effects of Lipoprotein Lipase Polymorphisms with Lifestyle on Lipid Levels in a Korean Population: A Cross-sectional Study.
Jung A PYUN ; Sunshin KIM ; Kyungchae PARK ; Inkyung BAIK ; Nam H CHO ; Insong KOH ; Jong Young LEE ; Yoon Shin CHO ; Young Jin KIM ; Min Jin GO ; Eugene SHIM ; Kyubum KWACK ; Chol SHIN
Genomics & Informatics 2012;10(2):88-98
Lipoprotein lipase (LPL) plays an essential role in the regulation of high-density lipoprotein cholesterol (HDLC) and triglyceride levels, which have been closely associated with cardiovascular diseases. Genetic studies in European have shown that LPL single-nucleotide polymorphisms (SNPs) are strongly associated with lipid levels. However, studies about the influence of interactions between LPL SNPs and lifestyle factors have not been sufficiently performed. Here, we examine if LPL polymorphisms, as well as their interaction with lifestyle factors, influence lipid concentrations in a Korean population. A two-stage association study was performed using genotype data for SNPs on the LPL gene, including the 3' flanking region from 7,536 (stage 1) and 3,703 (stage 2) individuals. The association study showed that 15 SNPs and 4 haplotypes were strongly associated with HDLC (lowest p = 2.86 x 10(-22)) and triglyceride levels (lowest p = 3.0 x 10(-15)). Interactions between LPL polymorphisms and lifestyle factors (lowest p = 9.6 x 10(-4)) were also observed on lipid concentrations. These findings suggest that there are interaction effects of LPL polymorphisms with lifestyle variables, including energy intake, fat intake, smoking, and alcohol consumption, as well as effects of LPL polymorphisms themselves, on lipid concentrations in a Korean population.
3' Flanking Region
;
Alcohol Drinking
;
Cardiovascular Diseases
;
Cholesterol
;
Cross-Sectional Studies
;
Energy Intake
;
Genotype
;
Haplotypes
;
Life Style
;
Lipoprotein Lipase
;
Lipoproteins
;
Polymorphism, Single Nucleotide
;
Smoke
;
Smoking