Improve Zero-shot Classification in Vision-Language Models by Bridging the Modality Gap

Requirements

M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, Telecommunications, or similar
Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
Software development skills
Basic concepts of image processing, natural language processing
Basic concepts of data science, concerning data analysis, processing and machine learning
Basic concepts of linear algebra and statistics

Description

The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored. This gap refers to the semantic misalignment between image and text representations. Recent works have explored how reducing this gap during fine-tuning can improve zero-shot classification performance of VLMs. However, there is still no widely accepted method to reduce this gap, and the path to new methodologies remains open. Moreover, VLMs have been adapted to tasks beyond classification, where the impact of Modality Gap reduction is underexplored.

This thesis aims to explore new methods to address the Modality Gap, comparing them to existing methodologies on downstream benchmarks (e.g., classification datasets).

The main activities of the thesis include:

Review of existing pre-trained VLMs (CLIP, ALIGN, FLAVA) and the methodologies to address the Modality Gap.
Set up an appropriate testing framework, including classification datasets.
Develop new methods to reduce the Modality Gap.
Compare the new methods with existing ones in terms of classification performance and feature quality.

Contacts

Federico D’Asaro