Position-Guided Text Prompt for Vision-Language Pre-Training

Citation DataProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, ISSN: 1063-6919, Vol: 2023-June, Page: 23242-23251

Publication Year2023

21
Citations
12
Usage
75
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Citations
21
- Citation Indexes
  21
Usage
12
- Downloads
  10
- Abstract Views
  2
Captures
75
- Readers
  75

Conference Paper Description

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling '[P]' or '[O]' in a PTP 'The block [P] has a [O]'. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot.

Bibliographic Details

DOI10.1109/cvpr52729.2023.02226

REPOSITORY URLhttps://ink.library.smu.edu.sg/sis_research/9021

URL IDhttp://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85168724998&origin=inward; http://dx.doi.org/10.1109/cvpr52729.2023.02226; https://ieeexplore.ieee.org/document/10204271/; https://ink.library.smu.edu.sg/sis_research/9021; https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=10024&context=sis_research

AUTHOR(S)

Alex Jinpeng WANG; Pan ZHOU; Mike Zheng SHOU; YAN Shuicheng

PUBLISHER(S)

Institute of Electrical and Electronics Engineers (IEEE)

TAG(S)

Computer Science

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know