Pathogenesis Reasoning Chain-of-thought Supervision for Large Language Models: Syndrome Manifestation Recognition and Multidimensional Evaluation in Spleen-stomach Disorders
- VernacularTitle:基于病机推理思维链监督的大模型脾胃病证候识别及多维评价研究
- Author:
Shu-Han YANG
1
;
Yu-Xin HU
2
;
Xin-Yu YU
1
;
Yu-Ying TU
3
;
Yi-Chang ZANG
1
;
Pan-Fei LI
1
Author Information
- Publication Type:Journal Article
- Keywords: traditional Chinese medicine syndrome manifestation recognition; large language models; chain-of-thought; syndrome pattern evaluation; spleen-stomach disorders
- From: Progress in Biochemistry and Biophysics 2026;53(5):1240-1263
- CountryChina
- Language:English
- Abstract: ObjectiveThe essence of syndrome manifestation recognition in traditional Chinese medicine (TCM) is to infer the body’s latent pathogenesis state from clinical observational information, rather than to perform simple label matching. However, previous studies have largely modeled this task as syndrome pattern classification within a fixed label space, which does not adequately reflect the cognition process of TCM syndrome differentiation centered on pathogenesis reasoning, and is also insufficient to capture the openness, semantic variability, and cross-disease reusability of syndrome manifestation expression. This study aimed to investigate whether introducing pathogenesis reasoning chain-of-thought (PR-CoT) supervision into large language models (LLMs) could improve the quality and cognitive consistency of syndrome manifestation recognition and support cross-disease transfer. MethodsSyndrome manifestation recognition was formulated as a conditional generation task under the framework of clinical observational information (X)→pathogenesis structure (Z)→syndrome pattern output (Y), where Z serves as an explicit intermediate structural variable linking the clinical evidence and syndrome judgment. Within this framework, a PR-CoT-supervised dataset for syndrome manifestation recognition was constructed based on medical case records of spleen-stomach disorders. After preprocessing, information extraction, manual proofreading, and data cleaning, the dataset comprised 4 800 training cases, 400 development cases, and 400 test cases. Each sample was annotated with a structured PR-CoT consisting of three progressive levels: clinical information summarization, comprehensive pathogenesis analysis, and syndrome pattern output. Supervised fine-tuning was conducted on open-source LLMs, with an end-to-end model serving as the baseline. Qwen3-32B was used as the primary experimental model, and Qwen3-14B as the scale comparison model. A progressive multidimensional evaluation framework was further established, comprising a structural parsing level, a semantic similarity level, and an expert blind review level. At the structural parsing level, syndrome pattern expressions were decomposed into structural elements and evaluated using Precision, Recall, F1 score, and Jaccard similarity. At the semantic similarity level, independent LLMs scored the theoretical proximity between predicted and reference syndrome patterns. At the expert blind review level, three TCM experts independently evaluated model outputs on two dimensions: syndrome differentiation consistency and terminology standardization of syndrome patterns. In addition, zero-shot cross-disease transfer evaluation was conducted on gynecological and heart-system disorder test sets. ResultsAt the structural parsing level, PR-CoT supervision did not lead to a stable improvement in the element-wise overlap of syndrome pattern structural components. Compared with the corresponding baselines, neither Qwen3-32B nor Qwen3-14B showed consistent advantages in structural matching metrics after the introduction of PR-CoT supervision. In contrast, at the semantic similarity level, PR-CoT supervision produced stable positive gains across different model scales and evaluation systems. The average semantic score of Qwen3-32B increased from 6.425 8 in the baseline model to 6.585 0 after PR-CoT supervision, and that of Qwen3-14B increased from 5.870 0 to 5.964 2. At the expert blind review level, the overall score of Qwen3-32B (PR-CoT) was 7.026 0±0.107 7, higher than 6.416 3±0.288 9 for its baseline. In zero-shot cross-disease testing, the PR-CoT model still showed advantages in semantic evaluation and expert evaluation on both gynecological and heart-system disorder test sets, indicating a certain degree of transferability. ConclusionThe benefits of PR-CoT supervision are mainly reflected in TCM semantic consistency and clinical plausibility, rather than in improved hard matching of structural elements. These findings support understanding syndrome manifestation recognition as a process of generating and expressing latent pathogenesis structures, rather than as a classification task within a traditional fixed label space. By introducing pathogenesis reasoning as an explicit intermediate structure into the modeling process and combining it with a progressive multidimensional evaluation framework, this study provides a methodological pathway for intelligent TCM syndrome differentiation that integrates theoretical alignment, interpretability, and multi-level evaluation.
