Tutorial - CLiC-it 2025

Sandro Pezzelle

Assistant Professor, University of Amsterdam, ILLC, Faculty of Science

Slides available here!

Title: Language-and-Vision Models: From Image-Language Alignment to Storytelling and Narration

Abstract: Multimodal models that combine language and vision have seen rapid progress, driven by advances in large-scale pretraining and transformer-based architectures. Traditionally, research in language-and-vision has focused on single-image inputs and direct alignment between visual and textual modalities. Recent advances in model capabilities open the door to more complex and communicatively naturalistic scenarios, such as generating language that interprets, abstracts, or enriches visual input, rather than merely describing it.

This tutorial provides an accessible yet in-depth overview of language-and-vision models, ranging from traditional modular pipelines to the latest end-to-end pre-trained systems (VLMs). We will introduce foundational concepts and architectures, then shift focus to recent approaches that generate complex, context-sensitive outputs, such as visual storytelling and event narration.

Special emphasis will be placed on evaluation: How do we assess outputs when there is no single ground truth? What metrics can capture narrative quality, coherence, or relevance to visual context? We will survey recent benchmarks, discuss open challenges, and present best practices for both automatic and human-centered evaluation.

Short bio: Sandro’s research focuses on the development, evaluation, and mechanistic analysis of AI models for natural language processing (NLP). Since his PhD at the University of Trento, he has been recognized as an expert in multimodal NLP, contributing to the creation of vision-language benchmarks and metrics (for image captioning, visual question answering, and visual dialogue), model development, and the in-depth analysis of human-like language and cognitive abilities. He is a faculty member of the European Laboratory for Learning and Intelligent Systems (ELLIS), a member of the Center for Explainable, Responsible, and Theory-Driven Artificial Intelligence (CERTAIN), and a board member of the ACL Special Interest Group in Computational Semantics (SigSem).