Skip to the content.

Composed Image Retrieval on Real-life Images

Composed Image Retrieval (or, Image Retreival conditioned on Language Feedback) is a relatively new retrieval task, where an input query consists of an image and short textual description of how to modify the image.

For humans the advantage of a bi-modal query is clear: some concepts and attributes are more succinctly described visually, others through language. By cross-referencing the two modalities, a reference image can capture the general gist of a scene, while the text can specify finer details.

We identify a major challenge of this task as the inherent ambiguity in knowing what information is important (typically one object of interest in the scene) and what can be ignored (e.g., the background and other irrelevant objects).

Here, we extend the task of composed image retrieval by introducing the Composed Image Retrieval on Real-life images (CIRR) dataset - the first dataset of open-domain, real-life images with human-generated modification sentences.

Demo image from CIRR data

Concurrently, we release the code and pre-trained models for our method Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT). Together with the dataset, we believe this work will inspire further research on this task on a finer-grain level.

Read more in our published paper.

View our 5-minute video.

You are currently viewing the Project homepage.

CIRR Dataset

Click to download our dataset.

We do not publish the ground truth for the test split of CIRR. Instead, we host an evaluation server, should you prefer to publish results on the test-split.

Note, the ground truth for the validation split is available as usual and can be used for development.


Click to access our codebase.

Our code is in PyTorch, and is based on PyTorch Lightning.

To encourage continuing research in this task, we will additionally provide an implementation of TIRG that is compatible with our codebase (coming soon).




Please cite our paper if it helps your research:

  author    = {Zheyuan Liu and
               Cristian Rodriguez and
               Damien Teney and
               Stephen Gould},
  title     = {Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models},
  booktitle = {ICCV},
  year      = {2021}


If you have any questions regarding our dataset, model, or publication, please create an issue in the project repository, or email