PlumX Metrics
Embed PlumX Metrics

Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects

Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, ISSN: 2159-5399, Vol: 36, Issue: 10, Page: 11267-11275
2022
  • 0
    Citations
  • 0
    Usage
  • 4
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

Conference Paper Description

Architectures that model language and vision together have received much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definitionmodeling as well as the code to reproduce our models.

Bibliographic Details

Bianca Scarlini; Roberto Navigli; Tommaso Pasini

Association for the Advancement of Artificial Intelligence (AAAI)

Computer Science

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know