Modality Gap in Vision-Language Models (VLMs)

Requirements

M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, Telecommunications, or similar
Knowledge of Python
Software development skills
Basic concepts of image processing and language processing
Basic concepts of data science, concerning data analysis, processing and machine learning

Description

The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored in other VLM architectures. This gap refers to the semantic misalignment between image and text representations. While CLIP has highlighted this gap, there’s no widely accepted measure or standard benchmark dataset for its evaluation, though datasets like MS COCO or Flickr30k can provide insights. This thesis aims to explore and address the Modality Gap in modern VLM architectures by devising methods for its quantification and visualization. The project comprises a review and of large pre-trained VLM models with a focus on recent developments concerning the Modality Gap. The thesis will encompass the following activities: (i) Datasets and Models Identification: identify relevant benchmark datasets and pretrained VLM models, such as CLIP, ALBEF, Florence, FLAVA, and others. (ii) Data Analysis and Pre-processing: pre-process datasets to prepare them for Modality-Gap evaluation. (iii) Evaluation: Implement evaluation metrics and visualization methods to quantify the Modality Gap across different VLM architectures (utilizing frameworks like PyTorch, Keras, etc.). (v) Result Analysis and Visualization.

Contacts

Federico D’Asaro