1.Audio–Text Contrastive Representation Learning for Voice Assessment: Toward Assistive Clinical Applications
Kwang Hyeon KIM ; Yoonkyoung SO
Journal of the Korean Society of Laryngology Phoniatrics and Logopedics 2026;37(1):33-42
Background and Objectives:
This study aims to develop and validate a bidirectional audio–text contrastive representation learning framework that enhances alignment between linguistic content and speech production characteristics, thereby exploring its potential utility for future clinical voice assessment applications.Materials and Method A dual-encoder multimodal contrastive model was trained using 12854 Korean speech–text pairs. Audio inputs were converted into 80-channel Mel spectrograms with SpecAugment, and text was processed using a KoBERT-based tokenizer. Joint embeddings were optimized via bidirectional cosine-similarity InfoNCE loss over 100 epochs. Retrieval-based evaluation quantified alignment performance between paired inputs.
Results:
The dataset exhibited an average utterance duration of 3.60±1.01 second and 4.99± 1.73 words, with high variability in phonetic realizations reflected by an ASR baseline word error rate of 100.69% (p=0.0311 for utterance-length effects). Despite such heterogeneity, the proposed model achieved consistent multimodal alignment, with audio-to-text and text-to-audio retrieval Recall@10 of 0.523 and 0.520, respectively, and median ranks of 9.00 and 10.00.
Conclusion
The findings indicate that contrastive alignment can produce robust multimodal speech representations under diverse reading patterns. Although clinical validation using pathological speech remains necessary, this work establishes a proof-of-concept basis for future voice assessment applications leveraging retrieval-based embedding analysis.

Result Analysis
Print
Save
E-mail