Requirements

M.Sc. in Machine Learning, Data Science, Computer Science, Artificial Intelligence, Mathematics, or similar
Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
Software development and experimentation skills
Basic concepts of natural language processing
Basic concepts of multimodal learning and Transformer architectures
Basic concepts of linear algebra, probability, and statistics

Description

Automatic understanding and cataloguing of artworks is a critical task in digital humanities, museums, and cultural heritage archives. Multimodal datasets combining painting images with textual descriptions and structured metadata enable the development of intelligent systems capable of assisting curators and researchers.

Recent Vision–Language Models (VLMs) have demonstrated strong multimodal reasoning abilities across tasks such as visual question answering, captioning, and image–text retrieval. However, large-scale VLMs are computationally expensive and difficult to deploy in domain-specific or resource-constrained settings.

Two promising strategies to adapt these models efficiently are parameter-efficient fine-tuning and knowledge distillation.

Low-Rank Adaptation (LoRA) enables fine-tuning large Transformer models by training only small low-rank parameter matrices while keeping most pretrained weights frozen, significantly reducing training cost.

Knowledge distillation, instead, transfers knowledge from a large “teacher” model to a smaller “student” model by training the latter to mimic the teacher’s predictions, enabling compact models with competitive performance.

The objective of this thesis is to compare these two adaptation paradigms for artistic metadata generation, where the system receives an artwork image and textual queries (e.g., “Who is the main author of this painting?”) and must produce accurate metadata annotations.

Experimental Setup

The thesis will develop and evaluate three main experimental configurations:

Zero-Shot Evaluation
- Large VLM baseline
- Small VLM baseline
LoRA Fine-Tuning
- Apply LoRA adapters to the small VLM
Knowledge Distillation
- Use the large VLM as teacher and distill knowledge into the small VLM

For the evaluation we will use a dataset that consists of multimodal art records including painting images, textual descriptions, and structured metadata such as author, artistic movement, period, and technique.

Main Activities

The main activities of the thesis include:

Reviewing the literature on Vision–Language Models, parameter-efficient fine-tuning, and knowledge distillation
Analyzing multimodal datasets for artistic and cultural heritage metadata
Designing experimental pipelines for zero-shot, LoRA fine-tuned, and distilled models
Implementing training and evaluation workflows
Evaluating models on visual question answering and metadata generation tasks
Analyzing, visualizing, and summarizing experimental results in terms of accuracy, efficiency, and scalability

Efficient Adaptation of Vision–Language Models for Artistic Metadata Generation

Requirements

Description

Experimental Setup

Main Activities

Contacts