Small bowel video keyframe retrieval based on multi-modal contrastive learning.
10.7507/1001-5515.202304021
- Author:
Xing WU
1
;
Guoyin YANG
1
;
Jingwen LI
1
;
Jian ZHANG
2
;
Qun SUN
3
;
Xianhua HAN
4
;
Quan QIAN
1
;
Yanwei CHEN
5
Author Information
1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, P. R. China.
2. Medical College of Shanghai University, Shanghai University, Shanghai 200444, P. R. China.
3. Gastroenterology, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiaotong University, Shanghai 200233, P. R. China.
4. Faculty of Science, Yamaguchi University, Yamaguchi-Ken 753-8511, Japan.
5. College of Information Science and Engineering, Ritsumeikan University, Shiga-Ken 525-8577, Japan.
- Publication Type:Journal Article
- Keywords:
Contrastive learning;
Multi-modal learning;
Video keyframe retrieval
- MeSH:
Humans;
Video Recording;
Intestine, Small/diagnostic imaging*;
Machine Learning;
Image Processing, Computer-Assisted/methods*;
Algorithms
- From:
Journal of Biomedical Engineering
2025;42(2):334-342
- CountryChina
- Language:Chinese
-
Abstract:
Retrieving keyframes most relevant to text from small intestine videos with given labels can efficiently and accurately locate pathological regions. However, training directly on raw video data is extremely slow, while learning visual representations from image-text datasets leads to computational inconsistency. To tackle this challenge, a small bowel video keyframe retrieval based on multi-modal contrastive learning (KRCL) is proposed. This framework fully utilizes textual information from video category labels to learn video features closely related to text, while modeling temporal information within a pretrained image-text model. It transfers knowledge learned from image-text multimodal models to the video domain, enabling interaction among medical videos, images, and text data. Experimental results on the hyper-spectral and Kvasir dataset for gastrointestinal disease detection (Hyper-Kvasir) and the Microsoft Research video-to-text (MSR-VTT) retrieval dataset demonstrate the effectiveness and robustness of KRCL, with the proposed method achieving state-of-the-art performance across nearly all evaluation metrics.