Improve Zero-shot Classification in Vision-Language Models by Bridging the Modality Gap
A comparison of existing methods for reducing the modality gap in pre-trained vision-language models with new approaches aimed at improving performance on downstream tasks (such as zero-shot classification).
Requirements
- M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, Telecommunications, or similar
- Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
- Software development skills
- Basic concepts of image processing, natural language processing
- Basic concepts of data science, concerning data analysis, processing and machine learning
- Basic concepts of linear algebra and statistics
Description
The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored. This gap refers to the semantic misalignment between image and text representations. Recent works have explored how reducing this gap during fine-tuning can improve zero-shot classification performance of VLMs. However, there is still no widely accepted method to reduce this gap, and the path to new methodologies remains open. Moreover, VLMs have been adapted to tasks beyond classification, where the impact of Modality Gap reduction is underexplored.
This thesis aims to explore new methods to address the Modality Gap, comparing them to existing methodologies on downstream benchmarks (e.g., classification datasets).
The main activities of the thesis include:
- Review of existing pre-trained VLMs (CLIP, ALIGN, FLAVA) and the methodologies to address the Modality Gap.
- Set up an appropriate testing framework, including classification datasets.
- Develop new methods to reduce the Modality Gap.
- Compare the new methods with existing ones in terms of classification performance and feature quality.