PlumX Metrics
Embed PlumX Metrics

The Storyteller: Computer Vision Driven Context and Content Generation System

SSRN, ISSN: 1556-5068
2023
  • 0
    Citations
  • 158
    Usage
  • 0
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

  • Usage
    158
    • Abstract Views
      123
    • Downloads
      35

Article Description

The human capability of detecting, understanding, and contextualizing objects in the real world by machines has always been a dream for computer scientists. Along with other important and pending challenges in computer vision, image captioning with context and content is an important research problem. In our research, we attempted to develop a human-like storytelling system that can caption images with the perspective of content, context, syntax, and knowledge. Our methodology combines Capsule Networks for image encoding, Knowledge Graphs for content and context awareness, and Transformer Neural Networks for decoding. Spatial, geometrical, and orientational details are extracted using Capsule Networks during feature extraction. The corpus is passed through the Knowledge Graph to equip our model with content, context, and semantics. The decoding phase combines Knowledge Graph and Transformer Neural Network for knowledge-driven captioning. Dynamic multi-headed attention in the decoder is used for memory optimization. Our model is trained over MSCOCO and tested over MSCOCO, Flickr16K, and Google Images. The results provide good content and context understanding with B4: 71.93, M: 39.14, C: 136.53, and R: 94.32. The usage of adverbs and adjectives within the generated sentence according to the objects' geometrical and semantic relationship is phenomenal. The primary outcome of our research is generating autonomous story-type captions for real-world images.

Bibliographic Details

Anwar ul Haque; Sayeed Ghani; Muhammad Saeed; Hardy Schloer

Elsevier BV

Multidisciplinary; Capsule NetworksImage CaptioningKnowledge Graphs Transformer Neural NetworksContext-aware captioning

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know