1.QingNangTCM: a parameter-efficient fine-tuning large language model for traditional Chinese medicine
Xuming TONG ; Liyan LIU ; Yanhong YUAN ; Xiaozheng DING ; Huiru JIA ; Xu YANG ; Sio Kei IM ; Mini Han WANG ; Zhang XIONH ; Yapeng WANG
Digital Chinese Medicine 2026;9(1):1-12
Objective:
To develop QingNangTCM, a specialized large language model (LLM) tailored for expert-level traditional Chinese medicine (TCM) question-answering and clinical reasoning, addressing the scarcity of domain-specific corpora and specialized alignment.
Methods:
We constructed QnTCM_Dataset, a corpus of 100 000 entries, by integrating data from ShenNong_TCM_Dataset and SymMap v2.0, and synthesizing additional samples via retrieval-augmented generation (RAG) and persona-driven generation. The dataset comprehensively covers diagnostic inquiries, prescriptions, and herbal knowledge. Utilizing P-Tuning v2, we fine-tuned the GLM-4-9B-Chat backbone to develop QingNangTCM. A multi-dimensional evaluation framework, assessing accuracy, coverage, consistency, safety, professionalism, and fluency, was established using metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), metric for evaluation of translation with explicit ordering (METEOR), and LLM-as-a-Judge with expert review. Qualitative analysis was conducted across four simulated clinical scenarios: symptom analysis, disease treatment, herb inquiry, and failure cases. Baseline models included GLM-4-9B-Chat, DeepSeek-V2, HuatuoGPT-II (7B), and GLM-4-9B-Chat (freeze-tuning).
Results:
QingNangTCM achieved the highest scores in BLEU-1/2/3/4 (0.425/0.298/0.137/0.064), ROUGE-1/2 (0.368/0.157), and METEOR (0.218), demonstrating a balanced and superior normalized performance profile of 0.900 across the dimensions of accuracy, coverage, and consistency. Although its ROUGE-L score (0.299) was lower than that of HuatuoGPT-II (7B) (0.351), it significantly outperformed domain-specific models in expert-validated win rates for professionalism (86%) and safety (73%). Qualitative analysis confirmed that the model strictly adheres to the “symptom-syndrome-pathogenesis-treatment” reasoning chain, though occasional misclassifications and hallucinations persisted when dealing with rare medicinal materials and uncommon syndromes.
Conclusion
Combining domain-specific corpus construction with parameter-efficient prompt tuning enhances the reasoning behavior and domain adaptation of LLMs for TCM-related tasks. This work provides a technical framework for the digital organization and intelligent utilization of TCM knowledge, with potential value for supporting diagnostic reasoning and medical education.
2.NLUS-VQA: construction and evaluation of a visual question answering model for neonatal lung ultrasound diagnosis
Xuming TONG ; Jiangang CHEN ; Yiran WANG ; Xiqing ZHAO ; Yanhong YUAN ; Zishuo WANG ; Peng JIANG ; Qingyao XIONG ; Renxing LI ; Xueli WANG ; Jing LIU
Chinese Journal of Perinatal Medicine 2025;28(11):917-928
Objective:To develop and evaluate a medical visual question answering (VQA) model for neonatal lung ultrasound (LUS) images to enhance intelligent auxiliary diagnosis of neonatal pulmonary diseases.Methods:Using data from neonates admitted to Beijing Obstetrics and Gynecology Hospital, Capital Medical University (January 2023 to December 2024), an image-question-answer dataset comprising 251 LUS images was constructed [43 pneumonia (17.1%), 42 neonatal respiratory distress syndrome (16.7%), 83 transient tachypnea (33.1%), and 83 normal (33.1%) images] with a four-tier medical question-answer framework. Building upon the Qwen2.5-VL-7B base model and integrating LoRA fine-tuning with chain-of-thought prompting, we developed the NLUS-VQA model to enhance visual-language semantic alignment and enable stepwise clinical reasoning, achieving efficient small-sample adaptation. Model performance was comprehensively assessed through natural language generation metrics (BLEU-4, ROUGE-1/2/L), qualitative evaluation of characteristic recognition, and clinical consistency analysis.Results:(1) Quantitative evaluation demonstrated that NLUS-VQA achieved scores of 22.38 (BLEU-4), 48.26 (ROUGE-1), 22.40 (ROUGE-2), and 37.20 (ROUGE-L), representing significant improvements over baseline models. (2) Qualitatively, the model exhibited strong performance in identifying lung consolidation, coalescent B-lines, and snowflake signs, with its chain-of-thought strategy enhancing clinical interpretability and answer accuracy. (3) Clinically, NLUS-VQA achieved a Cohen's Kappa coefficient of 0.78 and diagnostic accuracy of 80.8% (21/26), indicating substantial agreement with clinical experts.Conclusion:The NLUS-VQA model demonstrates robust interpretability in recognizing key sonographic patterns (e.g. lung consolidation, confluent B-lines, and snowflake signs), providing a scalable framework for small-sample medical image analysis, though diagnostic performance on complex conditions remains limited by dataset scale and minority class representation.
3.NLUS-VQA: construction and evaluation of a visual question answering model for neonatal lung ultrasound diagnosis
Xuming TONG ; Jiangang CHEN ; Yiran WANG ; Xiqing ZHAO ; Yanhong YUAN ; Zishuo WANG ; Peng JIANG ; Qingyao XIONG ; Renxing LI ; Xueli WANG ; Jing LIU
Chinese Journal of Perinatal Medicine 2025;28(11):917-928
Objective:To develop and evaluate a medical visual question answering (VQA) model for neonatal lung ultrasound (LUS) images to enhance intelligent auxiliary diagnosis of neonatal pulmonary diseases.Methods:Using data from neonates admitted to Beijing Obstetrics and Gynecology Hospital, Capital Medical University (January 2023 to December 2024), an image-question-answer dataset comprising 251 LUS images was constructed [43 pneumonia (17.1%), 42 neonatal respiratory distress syndrome (16.7%), 83 transient tachypnea (33.1%), and 83 normal (33.1%) images] with a four-tier medical question-answer framework. Building upon the Qwen2.5-VL-7B base model and integrating LoRA fine-tuning with chain-of-thought prompting, we developed the NLUS-VQA model to enhance visual-language semantic alignment and enable stepwise clinical reasoning, achieving efficient small-sample adaptation. Model performance was comprehensively assessed through natural language generation metrics (BLEU-4, ROUGE-1/2/L), qualitative evaluation of characteristic recognition, and clinical consistency analysis.Results:(1) Quantitative evaluation demonstrated that NLUS-VQA achieved scores of 22.38 (BLEU-4), 48.26 (ROUGE-1), 22.40 (ROUGE-2), and 37.20 (ROUGE-L), representing significant improvements over baseline models. (2) Qualitatively, the model exhibited strong performance in identifying lung consolidation, coalescent B-lines, and snowflake signs, with its chain-of-thought strategy enhancing clinical interpretability and answer accuracy. (3) Clinically, NLUS-VQA achieved a Cohen's Kappa coefficient of 0.78 and diagnostic accuracy of 80.8% (21/26), indicating substantial agreement with clinical experts.Conclusion:The NLUS-VQA model demonstrates robust interpretability in recognizing key sonographic patterns (e.g. lung consolidation, confluent B-lines, and snowflake signs), providing a scalable framework for small-sample medical image analysis, though diagnostic performance on complex conditions remains limited by dataset scale and minority class representation.

Result Analysis
Print
Save
E-mail