Publications | Lorenzo Agnolucci

2025

ICCV
Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

V. Hosu*, L. Agnolucci*, D. Iso, and D. Saupe

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Abs arXiv Bib PDF Code

Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. We will release the code, dataset, and pre-trained models upon acceptance.
@inproceedings{hosu2025image, title = {Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution}, author = {Hosu*, V. and Agnolucci*, L. and Iso, D. and Saupe, D.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, }
ICLR
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

M. Mistretta*, A. Baldrati*, L. Agnolucci*, M. Bertini, and A. D. Bagdanov

In The Thirteenth International Conference on Learning Representations, 2025

Abs arXiv Bib PDF Code Website

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style intermodal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intramodal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
@inproceedings{mistretta2025cross, title = {{Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion}}, author = {Mistretta*, M. and Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Bagdanov, A. D.}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }
TPAMI
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

L. Agnolucci*, A. Baldrati*, A. Del Bimbo, and M. Bertini

In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Abs arXiv Bib PDF Code

Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets – FashionIQ, CIRR, and the proposed CIRCO – and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.
@inproceedings{agnolucci2024isearle, title = {iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval}, author = {Agnolucci*, L. and Baldrati*, A. and Del Bimbo, A. and Bertini, M.}, booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {2025}, }

2024

ECCV Workshop
AIM 2024 Challenge on UHD Blind Photo Quality Assessment

V. Hosu, M. V. Conde, L. Agnolucci, N. Barman, S. Zadtootaghaj, and R. Timofte

In European Conference on Computer Vision, 2024

Abs arXiv Bib PDF Website

We introduce the AIM 2024 UHD-IQA Challenge, a competition to advance the No-Reference Image Quality Assessment (NR-IQA) task for modern, high-resolution photos. The challenge is based on the recently released UHD-IQA Benchmark Database, which comprises 6,073 UHD-1 (4K) images annotated with perceptual quality ratings from expert raters. Unlike previous NR-IQA datasets, UHD-IQA focuses on highly aesthetic photos of superior technical quality, reflecting the ever-increasing standards of digital photography. This challenge aims to develop efficient and effective NR-IQA models. Participants are tasked with creating novel architectures and training strategies to achieve high predictive performance on UHD-1 images within a computational budget of 50G MACs. This enables model deployment on edge devices and scalable processing of extensive image collections. Winners are determined based on a combination of performance metrics, including correlation measures (SRCC, PLCC, KRCC), absolute error metrics (MAE, RMSE), and computational efficiency (G MACs). To excel in this challenge, participants leverage techniques like knowledge distillation, low-precision inference, and multi-scale training. By pushing the boundaries of NR-IQA for high-resolution photos, the UHD-IQA Challenge aims to stimulate the development of practical models that can keep pace with the rapidly evolving landscape of digital photography. The innovative solutions emerging from this competition will have implications for various applications, from photo curation and enhancement to image compression.
@inproceedings{hosu2024aim, title = {AIM 2024 Challenge on UHD Blind Photo Quality Assessment}, author = {Hosu, V. and Conde, M. V. and Agnolucci, L. and Barman, N. and Zadtootaghaj, S. and Timofte, R.}, booktitle = {European Conference on Computer Vision}, year = {2024}, }
ECCV Workshop
UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment

V. Hosu, L. Agnolucci, O. Wiedemann, D. Iso, and D. Saupe

In European Conference on Computer Vision, 2024

Abs arXiv Bib PDF Website

We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. The dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in highly reliable labels. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html.
@inproceedings{hosu2024uhdiqa, title = {UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment}, author = {Hosu, V. and Agnolucci, L. and Wiedemann, O. and Iso, D. and Saupe, D.}, booktitle = {European Conference on Computer Vision}, year = {2024}, }
arXiv
Quality-Aware Image-Text Alignment for Opinion-Unaware Image Quality Assessment

L. Agnolucci, L. Galteri, and M. Bertini

2024

Abs arXiv Bib PDF Code

No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. Most state-of-the-art NR-IQA approaches are opinion-aware, i.e. they require human annotations for training. This dependency limits their scalability and broad applicability. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware approach that does not require human opinions. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate quality-aware image representations. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts. At the same time, we force CLIP to generate consistent representations for images with similar content and the same level of degradation. Our experiments show that the proposed method improves over existing opinion-unaware approaches across multiple datasets with diverse distortion types. Moreover, despite not requiring human annotations, QualiCLIP achieves excellent performance against supervised opinion-aware methods in cross-dataset experiments, thus demonstrating remarkable generalization capabilities. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.
@article{agnolucci2024qualityaware, title = {Quality-Aware Image-Text Alignment for Opinion-Unaware Image Quality Assessment}, author = {Agnolucci, L. and Galteri, L. and Bertini, M.}, year = {2024}, }
WACV Oral
ARNIQA: Learning Distortion Manifold for Image Quality Assessment

L. Agnolucci, L. Galteri, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Abs arXiv Bib PDF Code

No-Reference Image Quality Assessment (NR-IQA) aims to develop methods to measure image quality in alignment with human perception without the need for a high-quality reference image. In this work, we propose a self-supervised approach named ARNIQA (leArning distoRtion maNifold for Image Quality Assessment) for modeling the image distortion manifold to obtain quality representations in an intrinsic manner. First, we introduce an image degradation model that randomly composes ordered sequences of consecutively applied distortions. In this way, we can synthetically degrade images with a large variety of degradation patterns. Second, we propose to train our model by maximizing the similarity between the representations of patches of different images distorted equally, despite varying content. Therefore, images degraded in the same manner correspond to neighboring positions within the distortion manifold. Finally, we map the image representations to the quality scores with a simple linear regressor, thus without fine-tuning the encoder weights. The experiments show that our approach achieves state-of-the-art performance on several datasets. In addition, ARNIQA demonstrates improved data efficiency, generalization capabilities, and robustness compared to competing methods. The code and the model are publicly available at https://github.com/miccunifi/ARNIQA.
@inproceedings{agnolucci2024arniqa, title = {ARNIQA: Learning Distortion Manifold for Image Quality Assessment}, author = {Agnolucci, L. and Galteri, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, year = {2024}, }
WACV
Reference-based Restoration of Digitized Analog Videotapes

L. Agnolucci, L. Galteri, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Abs arXiv Bib PDF Code

Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs (TAPE). We leverage CLIP for zero-shot artifact detection to identify the cleanest frames of each video through textual prompts describing different artifacts. Then, we select the clean frames most similar to the input ones and employ them as references. We design a transformer-based Swin-UNet network that exploits both neighboring and reference frames via our Multi-Reference Spatial Feature Fusion (MRSFF) blocks. MRSFF blocks rely on cross-attention and attention pooling to take advantage of the most useful parts of each reference frame. To address the absence of ground truth in real-world videos, we create a synthetic dataset of videos exhibiting artifacts that closely resemble those commonly found in analog videotapes. Both quantitative and qualitative experiments show the effectiveness of our approach compared to other state-of-the-art methods. The code, the model, and the synthetic dataset are publicly available at https://github.com/miccunifi/TAPE.
@inproceedings{agnolucci2024reference, title = {Reference-based Restoration of Digitized Analog Videotapes}, author = {Agnolucci, L. and Galteri, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, year = {2024}, }

2023

ICCV
Zero-Shot Composed Image Retrieval with Textual Inversion

A. Baldrati*, L. Agnolucci*, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Abs arXiv Bib PDF Code Website

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
@inproceedings{baldrati2023zero, title = {Zero-Shot Composed Image Retrieval with Textual Inversion}, author = {Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {15338--15347}, year = {2023}, }
ICCV Workshop
ECO: Ensembling Context Optimization for Vision-Language Models

L. Agnolucci*, A. Baldrati*, F. Todino, F. Becattini, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Abs arXiv Bib PDF

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP’s classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.
@inproceedings{agnolucci2023eco, title = {ECO: Ensembling Context Optimization for Vision-Language Models}, author = {Agnolucci*, L. and Baldrati*, A. and Todino, F. and Becattini, F. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {2811--2815}, year = {2023}, }
ICCV Workshop
Mapping Memes to Words for Multimodal Hateful Meme Classification

G. Burbi, A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Abs arXiv Bib PDF Code

Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES.
@inproceedings{burbi2023mapping, title = {Mapping Memes to Words for Multimodal Hateful Meme Classification}, author = {Burbi, G. and Baldrati, A. and Agnolucci, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {2832--2836}, year = {2023}, }
ACM MM Demo
Zero-Shot Image Retrieval with Human Feedback

L. Agnolucci, A. Baldrati, M. Bertini, and A. Del Bimbo

In Proceedings of the 31st ACM International Conference on Multimedia, 2023

Abs Bib HTML PDF Code

Composed image retrieval extends traditional content-based image retrieval (CBIR) combining a query image with additional descriptive text to express user intent and specify supplementary requests related to the visual attributes of the query image. This approach holds significant potential for e-commerce applications, such as interactive multimodal searches and chatbots. In our demo, we present an interactive composed image retrieval system based on the SEARLE approach, which tackles this task in a zero-shot manner efficiently and effectively. The demo allows users to perform image retrieval iteratively refining the results using textual feedback.
@inproceedings{agnolucci2023zero, title = {Zero-Shot Image Retrieval with Human Feedback}, author = {Agnolucci, L. and Baldrati, A. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the 31st ACM International Conference on Multimedia}, pages = {9417--9419}, year = {2023}, }
IEEE TMM
Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

L. Agnolucci, L. Galteri, M. Bertini, and A. Del Bimbo

IEEE Transactions on Multimedia, 2023

Abs arXiv Bib PDF Code

In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at https://github.com/LorenzoAgnolucci/Keyframes-GAN.
@article{agnolucci2023perceptual, title = {Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN}, author = {Agnolucci, L. and Galteri, L. and Bertini, M. and Del Bimbo, A.}, journal = {IEEE Transactions on Multimedia}, year = {2023}, publisher = {IEEE}, }

2022

ACM MM Demo
Restoration of Analog Videos Using Swin-UNet

L. Agnolucci, L. Galteri, M. Bertini, and A. Del Bimbo

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

Abs arXiv Bib PDF Code

In this paper we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal also with severe tape mistracking, which results in completely scrambled frames. Tests on real-world videos from a major historical video archive show the effectiveness of our demo system.
@inproceedings{agnolucci2022restoration, title = {Restoration of Analog Videos Using Swin-UNet}, author = {Agnolucci, L. and Galteri, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the 30th ACM International Conference on Multimedia}, pages = {6985--6987}, year = {2022}, }