Audio–Text Contrastive Representation Learning for Voice Assessment: Toward Assistive Clinical Applications
10.22469/jkslp.2026.37.1.33
- Author:
Kwang Hyeon KIM
1
;
Yoonkyoung SO
Author Information
1. Clinical Research Support Center, Inje University Ilsan Paik Hospital, Goyang, Korea
- Publication Type:Original Article
- From:Journal of the Korean Society of Laryngology Phoniatrics and Logopedics
2026;37(1):33-42
- CountryRepublic of Korea
- Language:Korean
-
Abstract:
Background and Objectives:This study aims to develop and validate a bidirectional audio–text contrastive representation learning framework that enhances alignment between linguistic content and speech production characteristics, thereby exploring its potential utility for future clinical voice assessment applications.Materials and Method A dual-encoder multimodal contrastive model was trained using 12854 Korean speech–text pairs. Audio inputs were converted into 80-channel Mel spectrograms with SpecAugment, and text was processed using a KoBERT-based tokenizer. Joint embeddings were optimized via bidirectional cosine-similarity InfoNCE loss over 100 epochs. Retrieval-based evaluation quantified alignment performance between paired inputs.
Results:The dataset exhibited an average utterance duration of 3.60±1.01 second and 4.99± 1.73 words, with high variability in phonetic realizations reflected by an ASR baseline word error rate of 100.69% (p=0.0311 for utterance-length effects). Despite such heterogeneity, the proposed model achieved consistent multimodal alignment, with audio-to-text and text-to-audio retrieval Recall@10 of 0.523 and 0.520, respectively, and median ranks of 9.00 and 10.00.
Conclusion:The findings indicate that contrastive alignment can produce robust multimodal speech representations under diverse reading patterns. Although clinical validation using pathological speech remains necessary, this work establishes a proof-of-concept basis for future voice assessment applications leveraging retrieval-based embedding analysis.