Development and validation of PhenoRAG: A visualization tool for automated human phenotype ontology term annotation based on large language models and retrieval-augmented generation technology.

Wei ZHONG; Yousheng YAN; Kai YANG; Yan LIU; Xinyu FU; Zhengyang YAO; Chenghong YIN

Return

Development and validation of PhenoRAG: A visualization tool for automated human phenotype ontology term annotation based on large language models and retrieval-augmented generation technology.

Author: Wei ZHONG ¹ ; Yousheng YAN ; Kai YANG ; Yan LIU ; Xinyu FU ; Zhengyang YAO ; Chenghong YIN
Author Information

1. Department of Prenatal Diagnosis, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing Maternal and Child Health Care Hospital, Beijing 100026, China. yinchh@ccmu.edu.cn.
Publication Type:English Abstract
MeSH: Humans; Phenotype; Biological Ontologies; Language; Software; Large Language Models
From: Chinese Journal of Medical Genetics 2026;43(1):36-43
CountryChina
Language:Chinese
Abstract: OBJECTIVE:To develop a user-friendly visualization application for the automatic annotation of Human Phenotype Ontology (HPO) terms based on large language models and retrieval-augmented generation (RAG) technology, and to validate its performance in an authoritative case dataset.
METHODS:By integrating the domestic open-source large language model DeepSeek-V3 with RAG technology, an interactive web application was deployed on the Streamlit cloud platform. Using only the latest official HPO dataset as the data source, the lightweight sentence-embedding model BAAI/bge-small-en-v1.5 was employed to construct a FAISS vector index. During the online phase, a four-step closed-loop process is automatically completed: multilingual translation, phenotype phrase extraction, RAG candidate retrieval, term mapping, and official database validation. 121 English case reports publicly released by BMJ Case Reports and Oxford Medical Case Reports (with a gold-standard HPO set of 1 794 terms) were selected for application validation. Precision, recall, and F1 score were calculated and compared horizontally with traditional dictionary tools, standalone large language models, and the similar application "RAG-HPO". Finally, replace the model with the more advanced ChatGPT-5 and evaluate its performance on the newly extracted dataset.
RESULTS:An HPO term automatic annotation visualization application named PhenoRAG, based on large language models and RAG technology, was successfully developed. Users can access it directly via a web link. Across the 112 cases, a total of 2 150 HPO terms were generated; 2,064 (96.0%) were fully validated by the official database, with a hallucination rate of 1.3% and an HPO ID-name mismatch rate of 2.7%. After deduplication, 1,906 terms remained for testing. The overall precision was 63.65%, recall was 67.34%, and F1 was 65.44%, significantly outperforming traditional annotation tools (F1: 0.45-0.49, P < 0.001). Although PhenoRAG's F1 was lower than that of RAG-HPO (F1 = 0.78, P < 0.001), which relies on a manually constructed synonym database of 54 000 entries plus the HPO dataset, it requires no additional dictionary maintenance and can be used without any background in computer programming. Moreover, after switching to the GPT-5 model, PhenoRAG exhibited no hallucination rate on the new dataset, and its F1 score significantly increased (P = 0.038).
CONCLUSION:Without constructing a synonym database, the PhenoRAG achieved high-accuracy automatic mapping from clinical text to standard HPO terms. It features a low usage threshold, free access, and a Chinese-language interface, and can directly serve rare disease diagnosis, genetic counseling, and research scenarios in China and worldwide, warranting further clinical promotion and multicenter validation.