Efficient Adaptation of Vision–Language Models for Artistic Metadata Generation
This thesis involves specializing Vision–Language Models (VLMs) for multimodal artistic metadata generation through parameter-efficient fine-tuning and knowledge distillation approaches.
Requirements
- M.Sc. in Machine Learning, Data Science, Computer Science, Artificial Intelligence, Mathematics, or similar
- Knowledge of Python, with a focus on deep learning frameworks, particularly PyTorch
- Software development and experimentation skills
- Basic concepts of natural language processing
- Basic concepts of multimodal learning and Transformer architectures
- Basic concepts of linear algebra, probability, and statistics
Description
Automatic understanding and cataloguing of artworks is a critical task in digital humanities, museums, and cultural heritage archives. Multimodal datasets combining painting images with textual descriptions and structured metadata enable the development of intelligent systems capable of assisting curators and researchers.
Recent Vision–Language Models (VLMs) have demonstrated strong multimodal reasoning abilities across tasks such as visual question answering, captioning, and image–text retrieval. However, large-scale VLMs are computationally expensive and difficult to deploy in domain-specific or resource-constrained settings.
Two promising strategies to adapt these models efficiently are parameter-efficient fine-tuning and knowledge distillation.
Low-Rank Adaptation (LoRA) enables fine-tuning large Transformer models by training only small low-rank parameter matrices while keeping most pretrained weights frozen, significantly reducing training cost.
Knowledge distillation, instead, transfers knowledge from a large “teacher” model to a smaller “student” model by training the latter to mimic the teacher’s predictions, enabling compact models with competitive performance.
The objective of this thesis is to compare these two adaptation paradigms for artistic metadata generation, where the system receives an artwork image and textual queries (e.g., “Who is the main author of this painting?”) and must produce accurate metadata annotations.
Experimental Setup
The thesis will develop and evaluate three main experimental configurations:
-
Zero-Shot Evaluation
- Large VLM baseline
- Small VLM baseline
-
LoRA Fine-Tuning
- Apply LoRA adapters to the small VLM
-
Knowledge Distillation
- Use the large VLM as teacher and distill knowledge into the small VLM
For the evaluation we will use a dataset that consists of multimodal art records including painting images, textual descriptions, and structured metadata such as author, artistic movement, period, and technique.
Main Activities
The main activities of the thesis include:
- Reviewing the literature on Vision–Language Models, parameter-efficient fine-tuning, and knowledge distillation
- Analyzing multimodal datasets for artistic and cultural heritage metadata
- Designing experimental pipelines for zero-shot, LoRA fine-tuned, and distilled models
- Implementing training and evaluation workflows
- Evaluating models on visual question answering and metadata generation tasks
- Analyzing, visualizing, and summarizing experimental results in terms of accuracy, efficiency, and scalability