1.Improving spaCy dependency annotation and PoS tagging web service using independent NER services
Genomics & Informatics 2019;17(2):e21-
Dependency parsing is often used as a component in many text analysis pipelines. However, performance, especially in specialized domains, suffers from the presence of complex terminology. Our hypothesis is that including named entity annotations can improve the speed and quality of dependency parses. As part of BLAH5, we built a web service delivering improved dependency parses by taking into account named entity annotations obtained by third party services. Our evaluation shows improved results and better speed.
Natural Language Processing
2.Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms.
Bulgan GALBADRAKH ; Kyung Eun LEE ; Hyun Seok PARK
Genomics & Informatics 2012;10(4):266-270
Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.
Genes, Neoplasm
;
Motivation
;
Natural Language Processing
3.Korean Anaphora Recognition System to Develop Healthcare Dialogue-Type Agent.
Healthcare Informatics Research 2014;20(4):272-279
OBJECTIVES: Anaphora recognition is a process to identify exactly which noun has been used previously and relates to a pronoun that is included in a specific sentence later. Therefore, anaphora recognition is an essential element of a dialogue agent system. In the current study, all the merits of rule-based, machine learning-based, semantic-based anaphora recognition systems were combined to design and realize a new hybrid-type anaphora recognition system with an optimum capacity. METHODS: Anaphora recognition rules were encoded on the basis of the internal traits of referred expressions and adjacent contexts to realize a rule-based system and to serve as a baseline. A semantic database, related to predicate instances of sentences including referred expressions, was constructed to identify semantic co-relationships between the referent candidates (to which semantic tags were attached) and the semantic information of predicates. This approach would upgrade the anaphora recognition system by reducing the number of referent candidates. Additionally, to realize a machine learning-based system, an anaphora recognition model was developed on the basis of training data, which indicated referred expressions and referents. The three methods were further combined to develop a new single hybrid-based anaphora recognition system. RESULTS: The precision rate of the rule-based systems was 54.9%. However, the precision rate of the hybrid-based system was 63.7%, proving it to be the most efficient method. CONCLUSIONS: The hybrid-based method, developed by the combination of rule-based and machine learning-based methods, represents a new system with enhanced functional capabilities as compared to other pre-existing individual methods.
Delivery of Health Care*
;
Natural Language Processing
;
Semantics
4.Towards cross-platform interoperability for machine-assisted text annotation
Richard ECKART DE CASTILHO ; Nancy IDE ; Jin Dong KIM ; Jan Christoph KLIE ; Keith SUDERMAN
Genomics & Informatics 2019;17(2):e19-
In this paper, we investigate cross-platform interoperability for natural language processing (NLP) and, in particular, annotation of textual resources, with an eye toward identifying the design elements of annotation models and processes that are particularly problematic for, or amenable to, enabling seamless communication across different platforms. The study is conducted in the context of a specific annotation methodology, namely machine-assisted interactive annotation (also known as human-in-the-loop annotation). This methodology requires the ability to freely combine resources from different document repositories, access a wide array of NLP tools that automatically annotate corpora for various linguistic phenomena, and use a sophisticated annotation editor that enables interactive manual annotation coupled with on-the-fly machine learning. We consider three independently developed platforms, each of which utilizes a different model for representing annotations over text, and each of which performs a different role in the process.
Linguistics
;
Machine Learning
;
Natural Language Processing
5.Survey on natural language processing in medical image analysis.
Zhengliang LIU ; Mengshen HE ; Zuowei JIANG ; Zihao WU ; Haixing DAI ; Lian ZHANG ; Siyi LUO ; Tianle HAN ; Xiang LI ; Xi JIANG ; Dajiang ZHU ; Xiaoyan CAI ; Bao GE ; Wei LIU ; Jun LIU ; Dinggang SHEN ; Tianming LIU
Journal of Central South University(Medical Sciences) 2022;47(8):981-993
Recent advancement in natural language processing (NLP) and medical imaging empowers the wide applicability of deep learning models. These developments have increased not only data understanding, but also knowledge of state-of-the-art architectures and their real-world potentials. Medical imaging researchers have recognized the limitations of only targeting images, as well as the importance of integrating multimodal inputs into medical image analysis. The lack of comprehensive surveys of the current literature, however, impedes the progress of this domain. Existing research perspectives, as well as the architectures, tasks, datasets, and performance measures examined in the present literature, are reviewed in this work, and we also provide a brief description of possible future directions in the field, aiming to provide researchers and healthcare professionals with a detailed summary of existing academic research and to provide rational insights to facilitate future research.
Humans
;
Natural Language Processing
;
Surveys and Questionnaires
6.Text Mining in Biomedical Domain with Emphasis on Document Clustering.
Healthcare Informatics Research 2017;23(3):141-146
OBJECTIVES: With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. METHODS: This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. RESULTS: Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. CONCLUSIONS: Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Classification
;
Cluster Analysis*
;
Data Mining*
;
Mining
;
Natural Language Processing
7.Toward the Automatic Generation of the Entry Level CDA Documents.
Sungwon JUNG ; Seunghee KIM ; Sooyoung YOO ; Jinwook CHOI
Journal of Korean Society of Medical Informatics 2009;15(1):141-151
OBJECTIVE: CDA (Clinical Document Architecture) is a markup standard for clinical document exchange. In order to increase the semantic interoperability of documents exchange, the clinical statements in the narrative blocks should be encoded with code values. Natural language processing (NLP) is required in order to transform the narrative blocks into the coded elements in the level 3 CDA documents. In this paper, we evaluate the accuracy of text mapping methods which are based on NLP. METHODS: We analyzed about one thousand discharge summaries to know their characteristics and focused the syntactic patterns of the diagnostic sections in the discharge summaries. According to the patterns, different rules were applied for matching code values of Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). RESULTS: The accuracy of matching was evaluated using five-hundred discharge summaries. The precision was as follows: 86.5% for diagnosis, 61.8% for chief complaint, 62.7%, for problem list, and 64.8% for discharge medication. CONCLUSION: The text processing method based on the pattern analysis of a clinical statement can be effectively used for generating CDA entries.
Diagnosis
;
Natural Language Processing
;
Semantics
;
Systematized Nomenclature of Medicine
8.PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis.
Jae Hong EOM ; Byoung Tak ZHANG
Genomics & Informatics 2004;2(2):99-106
In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein-protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.
Data Mining*
;
Mining
;
Natural Language Processing
;
Machine Learning
9.MediScore: MEDLINE-based Interactive Scoring of Gene and Disease Associations.
Hye Young CHO ; Bermseok OH ; Jong Keuk LEE ; Kuchan KIMM ; InSong KOH
Genomics & Informatics 2004;2(3):131-133
MediScore is an information retrieval system, which helps to search for the set of genes associated with a specific disease or the set of diseases associated with a specific gene. Despite recent improvement of natural language processing (NLP) and other text mining approaches to search for disease associated genes, many false positive results come out due to diversity of exceptional cases as well as ambiguities in gene names. In order to overcome the weak points of current text mining approaches, MediScore introduces statistical normalization based on binomial to normal distribution approximation which corrects inaccurate scores caused by common words not representing genes and interactive rescoring by the user to remove the false positive results. Interactive rescoring includes individual alias scoring for each gene to remove false gene synonyms, referring MEDLINE abstracts, and cross referencing between OMIM and other related information.
Data Mining
;
Databases, Genetic
;
Information Systems
;
Natural Language Processing
10.Automatic labeling and extraction of terms in natural language processing in acupuncture clinical literature.
Hua-Yun LIU ; Chen-Jing HAN ; Jie XIONG ; Hai-Yan LI ; Lei LEI ; Bao-Yan LIU
Chinese Acupuncture & Moxibustion 2022;42(3):327-331
The paper analyzes the specificity of term recognition in acupuncture clinical literature and compares the advantages and disadvantages of three named entity recognition (NER) methods adopted in the field of traditional Chinese medicine. It is believed that the bi-directional long short-term memory networks-conditional random fields (Bi LSTM-CRF) may communicate the context information and complete NER by using less feature rules. This model is suitable for term recognition in acupuncture clinical literature. Based on this model, it is proposed that the process of term recognition in acupuncture clinical literature should include 4 aspects, i.e. literature pretreatment, sequence labeling, model training and effect evaluation, which provides an approach to the terminological structurization in acupuncture clinical literature.
Acupuncture Therapy
;
Electronic Health Records
;
Natural Language Processing