Enhancing Multimodal RAG Systems through Cross-Modal Retrieval and Reranking
This thesis involves the implementation of a cross-modal retrieval and reranking pipeline.
Requirements
- M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, Telecommunications, or similar
- Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
- Software development skills
- Basic concepts of image processing, natural language processing
- Basic concepts of data science, concerning data analysis, processing and machine learning
- Basic concepts of linear algebra and statistics
Description
Cross-modal retrieval is the task of retrieving relevant information across different modalities, such as retrieving images based on text queries or vice versa. This is a key challenge in multimodal AI, requiring efficient and accurate retrieval mechanisms.
In natural language processing (NLP), a common approach to improving retrieval system performance is to use a two-step pipeline consisting of retrieval and reranking. Rerankers encode both the query and candidate items jointly, making them more effective than standalone retrievers. However, this approach is computationally expensive and impractical for encoding entire downstream datasets. To balance efficiency and effectiveness, a reranker is typically applied on top of an encoder, refining the top-N most similar candidates retrieved by the encoder.
This approach can be adapted to cross-modal retrieval using dual-branch Vision-Language Models (VLMs) such as CLIP as the encoder, while multimodal models like UNITER and OSCAR serve as rerankers.
The objective of this thesis is to explore the use of multimodal encoders as rerankers for cross-modal retrieval, aiming to improve performance compared to encoder-only solutions.
The main activities of this thesis include:
- Exploring literature on VLM encoders (e.g., CLIP) and multimodal encoders such as UNITER, FLAVA, and ALBEF.
- Developing an experimental setup for cross-modal retrieval and reranking.
- Collecting datasets for cross-modal retrieval (e.g., MSCOCO, Flickr30k) and optionally for Vision Question Answering (VQA).
- Comparing Encoder-Only retrieval with Encoder + Reranking approaches for the cross-modal retrieval task and optionally for VQA.