1.An antibacterial peptides recognition method based on BERT and Text-CNN.
Xiaofang XU ; Chunde YANG ; Kunxian SHU ; Xinpu YUAN ; Mocheng LI ; Yunping ZHU ; Tao CHEN
Chinese Journal of Biotechnology 2023;39(4):1815-1824
Antimicrobial peptides (AMPs) are small molecule peptides that are widely found in living organisms with broad-spectrum antibacterial activity and immunomodulatory effect. Due to slower emergence of resistance, excellent clinical potential and wide range of application, AMP is a strong alternative to conventional antibiotics. AMP recognition is a significant direction in the field of AMP research. The high cost, low efficiency and long period shortcomings of the wet experiment methods prevent it from meeting the need for the large-scale AMP recognition. Therefore, computer-aided identification methods are important supplements to AMP recognition approaches, and one of the key issues is how to improve the accuracy. Protein sequences could be approximated as a language composed of amino acids. Consequently, rich features may be extracted using natural language processing (NLP) techniques. In this paper, we combine the pre-trained model BERT and the fine-tuned structure Text-CNN in the field of NLP to model protein languages, develop an open-source available antimicrobial peptide recognition tool and conduct a comparison with other five published tools. The experimental results show that the optimization of the two-phase training approach brings an overall improvement in accuracy, sensitivity, specificity, and Matthew correlation coefficient, offering a novel approach for further research on AMP recognition.
Anti-Bacterial Agents/chemistry*
;
Amino Acid Sequence
;
Antimicrobial Cationic Peptides/chemistry*
;
Antimicrobial Peptides
;
Natural Language Processing
2.Automatic labeling and extraction of terms in natural language processing in acupuncture clinical literature.
Hua-Yun LIU ; Chen-Jing HAN ; Jie XIONG ; Hai-Yan LI ; Lei LEI ; Bao-Yan LIU
Chinese Acupuncture & Moxibustion 2022;42(3):327-331
The paper analyzes the specificity of term recognition in acupuncture clinical literature and compares the advantages and disadvantages of three named entity recognition (NER) methods adopted in the field of traditional Chinese medicine. It is believed that the bi-directional long short-term memory networks-conditional random fields (Bi LSTM-CRF) may communicate the context information and complete NER by using less feature rules. This model is suitable for term recognition in acupuncture clinical literature. Based on this model, it is proposed that the process of term recognition in acupuncture clinical literature should include 4 aspects, i.e. literature pretreatment, sequence labeling, model training and effect evaluation, which provides an approach to the terminological structurization in acupuncture clinical literature.
Acupuncture Therapy
;
Electronic Health Records
;
Natural Language Processing
3.Survey on natural language processing in medical image analysis.
Zhengliang LIU ; Mengshen HE ; Zuowei JIANG ; Zihao WU ; Haixing DAI ; Lian ZHANG ; Siyi LUO ; Tianle HAN ; Xiang LI ; Xi JIANG ; Dajiang ZHU ; Xiaoyan CAI ; Bao GE ; Wei LIU ; Jun LIU ; Dinggang SHEN ; Tianming LIU
Journal of Central South University(Medical Sciences) 2022;47(8):981-993
Recent advancement in natural language processing (NLP) and medical imaging empowers the wide applicability of deep learning models. These developments have increased not only data understanding, but also knowledge of state-of-the-art architectures and their real-world potentials. Medical imaging researchers have recognized the limitations of only targeting images, as well as the importance of integrating multimodal inputs into medical image analysis. The lack of comprehensive surveys of the current literature, however, impedes the progress of this domain. Existing research perspectives, as well as the architectures, tasks, datasets, and performance measures examined in the present literature, are reviewed in this work, and we also provide a brief description of possible future directions in the field, aiming to provide researchers and healthcare professionals with a detailed summary of existing academic research and to provide rational insights to facilitate future research.
Humans
;
Natural Language Processing
;
Surveys and Questionnaires
4.Artificial intelligence based Chinese clinical trials eligibility criteria classification.
Hui ZONG ; Zeyu ZHANG ; Jinxuan YANG ; Jianbo LEI ; Zuofeng LI ; Tianyong HAO ; Xiaoyan ZHANG
Journal of Biomedical Engineering 2021;38(1):105-110
Subject recruitment is a key component that affects the progress and results of clinical trials, and generally conducted with eligibility criteria (includes inclusion criteria and exclusion criteria). The semantic category analysis of eligibility criteria can help optimizing clinical trials design and building automated patient recruitment system. This study explored the automatic semantic categories classification of Chinese eligibility criteria based on artificial intelligence by academic shared task. We totally collected 38 341 annotated eligibility criteria sentences and predefined 44 semantic categories. A total of 75 teams participated in competition, with 27 teams having submitted system outputs. Based on the results, we found out that most teams adopted mixed models. The mainstream resolution was applying pre-trained language models capable of providing rich semantic representation, which were combined with neural network models and used to fine-tune the models with reference to classifier tasks, and finally improved classification performance could be obtained by ensemble modeling. The best-performing system achieved a macro
Artificial Intelligence
;
China
;
Humans
;
Language
;
Natural Language Processing
;
Neural Networks, Computer
5.Health Information Technology Trends in Social Media: Using Twitter Data
Jisan LEE ; Jeongeun KIM ; Yeong Joo HONG ; Meihua PIAO ; Ahjung BYUN ; Healim SONG ; Hyeong Suk LEE
Healthcare Informatics Research 2019;25(2):99-105
OBJECTIVES: This study analyzed the health technology trends and sentiments of users using Twitter data in an attempt to examine the public's opinions and identify their needs. METHODS: Twitter data related to health technology, from January 2010 to October 2016, were collected. An ontology related to health technology was developed. Frequently occurring keywords were analyzed and visualized with the word cloud technique. The keywords were then reclassified and analyzed using the developed ontology and sentiment dictionary. Python and the R program were used for crawling, natural language processing, and sentiment analysis. RESULTS: In the developed ontology, the keywords are divided into ‘health technology‘ and ‘health information‘. Under health technology, there are are six subcategories, namely, health technology, wearable technology, biotechnology, mobile health, medical technology, and telemedicine. Under health information, there are four subcategories, namely, health information, privacy, clinical informatics, and consumer health informatics. The number of tweets about health technology has consistently increased since 2010; the number of posts in 2014 was double that in 2010, which was about 150 thousand posts. Posts about mHealth accounted for the majority, and the dominant words were ‘care‘, ‘new‘, ‘mental‘, and ‘fitness‘. Sentiment analysis by subcategory showed that most of the posts in nearly all subcategories had a positive tone with a positive score. CONCLUSIONS: Interests in mHealth have risen recently, and consequently, posts about mHealth were the most frequent. Examining social media users' responses to new health technology can be a useful method to understand the trends in rapidly evolving fields.
Biomedical Technology
;
Biotechnology
;
Boidae
;
Data Mining
;
Informatics
;
Medical Informatics
;
Methods
;
Natural Language Processing
;
Privacy
;
Public Opinion
;
Social Media
;
Telemedicine
6.Improving spaCy dependency annotation and PoS tagging web service using independent NER services
Genomics & Informatics 2019;17(2):e21-
Dependency parsing is often used as a component in many text analysis pipelines. However, performance, especially in specialized domains, suffers from the presence of complex terminology. Our hypothesis is that including named entity annotations can improve the speed and quality of dependency parses. As part of BLAH5, we built a web service delivering improved dependency parses by taking into account named entity annotations obtained by third party services. Our evaluation shows improved results and better speed.
Natural Language Processing
7.Towards cross-platform interoperability for machine-assisted text annotation
Richard ECKART DE CASTILHO ; Nancy IDE ; Jin Dong KIM ; Jan Christoph KLIE ; Keith SUDERMAN
Genomics & Informatics 2019;17(2):e19-
In this paper, we investigate cross-platform interoperability for natural language processing (NLP) and, in particular, annotation of textual resources, with an eye toward identifying the design elements of annotation models and processes that are particularly problematic for, or amenable to, enabling seamless communication across different platforms. The study is conducted in the context of a specific annotation methodology, namely machine-assisted interactive annotation (also known as human-in-the-loop annotation). This methodology requires the ability to freely combine resources from different document repositories, access a wide array of NLP tools that automatically annotate corpora for various linguistic phenomena, and use a sophisticated annotation editor that enables interactive manual annotation coupled with on-the-fly machine learning. We consider three independently developed platforms, each of which utilizes a different model for representing annotations over text, and each of which performs a different role in the process.
Linguistics
;
Machine Learning
;
Natural Language Processing
8.OryzaGP: rice gene and protein dataset for named-entity recognition
Pierre LARMANDE ; Huy DO ; Yue WANG
Genomics & Informatics 2019;17(2):e17-
Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.
Benchmarking
;
Biology
;
Data Mining
;
Dataset
;
Machine Learning
;
Methods
;
Molecular Biology
;
Natural Language Processing
;
Oryza
;
Plants
9.Resources for assigning MeSH IDs to Japanese medical terms
Genomics & Informatics 2019;17(2):e16-
Medical Subject Headings (MeSH), a medical thesaurus created by the National Library of Medicine (NLM), is a useful resource for natural language processing (NLP). In this article, the current status of the Japanese version of Medical Subject Headings (MeSH) is reviewed. Online investigation found that Japanese-English dictionaries, which assign MeSH information to applicable terms, but use them for NLP, were found to be difficult to access, due to license restrictions. Here, we investigate an open-source Japanese-English glossary as an alternative method for assigning MeSH IDs to Japanese terms, to obtain preliminary data for NLP proof-of-concept.
Asian Continental Ancestry Group
;
Humans
;
Licensure
;
Medical Subject Headings
;
Methods
;
National Library of Medicine (U.S.)
;
Natural Language Processing
;
Vocabulary, Controlled
10.PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts
Jordi ARMENGOL-ESTAPÉ ; Felipe SOARES ; Montserrat MARIMON ; Martin KRALLINGER
Genomics & Informatics 2019;17(2):e15-
Automatically detecting mentions of pharmaceutical drugs and chemical substances is key for the subsequent extraction of relations of chemicals with other biomedical entities such as genes, proteins, diseases, adverse reactions or symptoms. The identification of drug mentions is also a prior step for complex event types such as drug dosage recognition, duration of medical treatments or drug repurposing. Formally, this task is known as named entity recognition (NER), meaning automatically identifying mentions of predefined entities of interest in running text. In the domain of medical texts, for chemical entity recognition (CER), techniques based on hand-crafted rules and graph-based models can provide adequate performance. In the recent years, the field of natural language processing has mainly pivoted to deep learning and state-of-the-art results for most tasks involving natural language are usually obtained with artificial neural networks. Competitive resources for drug name recognition in English medical texts are already available and heavily used, while for other languages such as Spanish these tools, although clearly needed were missing. In this work, we adapt an existing neural NER system, NeuroNER, to the particular domain of Spanish clinical case texts, and extend the neural network to be able to take into account additional features apart from the plain text. NeuroNER can be considered a competitive baseline system for Spanish drug and CER promoted by the Spanish national plan for the advancement of language technologies (Plan TL).
Drug Repositioning
;
Learning
;
Machine Learning
;
Natural Language Processing
;
Neural Networks (Computer)
;
Neurons
;
Running

Result Analysis
Print
Save
E-mail