Audio–Text Contrastive Representation Learning for Voice Assessment: Toward Assistive Clinical Applications

Kwang Hyeon KIM; Yoonkyoung SO

Return

Audio–Text Contrastive Representation Learning for Voice Assessment: Toward Assistive Clinical Applications

Author: Kwang Hyeon KIM ¹ ; Yoonkyoung SO
Author Information

1. Clinical Research Support Center, Inje University Ilsan Paik Hospital, Goyang, Korea
Publication Type:Original Article
From:Journal of the Korean Society of Laryngology Phoniatrics and Logopedics 2026;37(1):33-42
CountryRepublic of Korea
Language:Korean
Abstract: Background and Objectives:This study aims to develop and validate a bidirectional audio–text contrastive representation learning framework that enhances alignment between linguistic content and speech production characteristics, thereby exploring its potential utility for future clinical voice assessment applications.Materials and Method A dual-encoder multimodal contrastive model was trained using 12854 Korean speech–text pairs. Audio inputs were converted into 80-channel Mel spectrograms with SpecAugment, and text was processed using a KoBERT-based tokenizer. Joint embeddings were optimized via bidirectional cosine-similarity InfoNCE loss over 100 epochs. Retrieval-based evaluation quantified alignment performance between paired inputs.
Results:The dataset exhibited an average utterance duration of 3.60±1.01 second and 4.99± 1.73 words, with high variability in phonetic realizations reflected by an ASR baseline word error rate of 100.69% (p=0.0311 for utterance-length effects). Despite such heterogeneity, the proposed model achieved consistent multimodal alignment, with audio-to-text and text-to-audio retrieval Recall@10 of 0.523 and 0.520, respectively, and median ranks of 9.00 and 10.00.
Conclusion:The findings indicate that contrastive alignment can produce robust multimodal speech representations under diverse reading patterns. Although clinical validation using pathological speech remains necessary, this work establishes a proof-of-concept basis for future voice assessment applications leveraging retrieval-based embedding analysis.