Investigating the Modality Gap in Multimodal Large Language Models
This thesis involves investigating modality gap in MLLMs over a chosen multimodal task.
Requirements
- M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, Telecommunications, or similar
- Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
- Software development skills
- Basic concepts of image processing, natural language processing
- Basic concepts of data science, concerning data analysis, processing and machine learning
- Basic concepts of linear algebra and statistics
Description
Multimodal Large Language Models (MLLMs) extend traditional Large Language Models (LLMs) by enabling generative and reasoning tasks that require multimodal inputs, such as multimodal classification and Vision–Question Answering (VQA). These models are typically built by integrating three components: a modality encoder that converts non-textual inputs (e.g., images) into token representations, a large language model that generates textual outputs, and a connector module that aligns the two modalities.
Despite this modular architecture, effectively bridging different modalities remains challenging. Similar to CLIP-like models, MLLMs may exhibit a modality gap, referring to semantic misalignment between visual and textual representations. While prior work has shown that reducing this gap improves zero-shot classification and cross-modal retrieval, its impact on Multimodal Large Language Models—particularly in image-to-text integration—remains largely unexplored.
The objective of this thesis is to investigate whether and how the modality gap affects the performance of MLLMs when integrating visual and textual modalities, and to analyze its role in limiting model capabilities across downstream tasks.
The main activities of the thesis include:
- Reviewing existing literature on Multimodal Large Language Models, with a focus on architectures such as LLaVA and Qwen2.5-VL.
- Identifying a target task or application domain for evaluating MLLMs.
- Collecting and analyzing relevant datasets to study the presence and characteristics of the modality gap in MLLMs.
- Applying and evaluating techniques aimed at mitigating identified challenges, including modality gap reduction strategies.
- Analyzing, visualizing, and summarizing experimental results.