Abstract: Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to ...
Abstract: Foundational vision-language models (VLMs) like CLIP are redefining the vision domain with their exceptional generalization capabilities. Prompt-based learning methods adapt pre-trained VLMs ...