PlumX Metrics
Embed PlumX Metrics

Audio content analysis

Semantic Multimedia and Ontologies: Theory and Applications, Page: 123-162
2008
  • 4
    Citations
  • 0
    Usage
  • 51
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

  • Citations
    4
    • Citation Indexes
      4
  • Captures
    51

Book Chapter Description

Since the introduction of digital audio more than 30 years ago, computers and signal processors have been capable of storing, modifying and transmitting sound signals. Before the advent of the Internet, compression technologies and digital telephony, such systems were aimed at the highest possible reproduction quality from physical media, or constrained to very specialised voice recognition or security systems. The first set of widespread techniques aimed at the extraction of semantics from audio were automatic speech recognition (ASR) systems. In the last couple of years, largescale, online distribution of high-quality audio has become a reality, widening the type of sounds to be analysed to music and any other kind of sounds, and shifting computers to the central position on the user side of the audio distribution chain. This has mainly been motivated by the advances in audio compression algorithms, especially the enormously successful MP3, and in network technologies. Audio content analysis (ACA), i.e. the automatic extraction of semantic information from sounds, arose naturally from the need to efficiently manage the growing collections of data and to enhance man-machine communication. ACA typically delivers a set of numerical measures from the audio signals, called audio features, that offer a compact and representative description. Such measures are usually called low-level features to denote that they represent a low level of abstraction. Although the classification is not strict, it is possible to consider as low-level features the measures most directly tied to the shape of the signal in the time or spectral domain, and which are mostly applicable to any kind of audio. Mid- and high-level features provide information more easily processed and usable by humans like phonemes, words or prosody in the case of speech or melody, harmony and structure in the case of music. To ensure interoperability, both low- and mid-level features can be conveyed as metadata by standardised syntactical formalisations, the most important of which is the audio part of the MPEG-7 standard (ISO/IEC 2002), based on the XML mark-up language. As a demanding pattern recognition problem, most ACA systems are still in the development and testing stage, with the exception of speech recognition systems. However, the advent of collaborative filtering methods and of semantic web technologies in the last couple of years allows us to envisage effective multimedia information retrieval systems that combine social and cultural metadata (i.e. the context) with the signal-related features (the content). Fixed taxonomies are evolving into dynamic ontologies that can encompass metadata from very heterogeneous sources; syntactic languages such as XML are evolving into semantic languages such as OWL (web ontology language). Sound data plays a crucial role in this paradigm shift since it represents both the most natural way of human communication (speech) and the most powerful digital entertainment industry (music). Several online services that are solely based on cultural or manually annotated metadata, and which will be mentioned later in the chapter, are enjoying a huge success to the present date. The current challenge is to combine that information with the features delivered by ACA in such a way that both robustness and usability are enhanced. The role of ACA within the emerging semantic technologies is thus twofold. On the one hand, it implements itself a powerful set of applications such as speech recognition, speaker segmentation, or music analysis, which solve specific needs of semantical access. On the other hand, it constitutes the basis of the bottom-up approach to overcome the semantic gap by defining a mapping between physical features and ontologic knowledge representations. The latter aspect has only recently been addressed, and little work has been done with generalised audio in this context. One of the few works that follow an ontological approach to generalised audio (i.e. speech, music, or noise) is the one by Nakatani and Okuno (1998). In this case, an ontology has been used to integrate different systems for stream segregation. More specific cases, like recommendation systems based on music ontologies, have gained more attention and will be briefly addressed in the corresponding sections. The present chapter provides an extensive insight into ACA techniques and their state of the art, and presents several recent systems as illustration. After a brief general overview (Section 5.2), this chapter follows the blocks in Fig. 5.1. The audio classification and segmentation stage (Section 5.3) recognises the different audio types contained in a general audio signal and their temporal borders. The following analysis techniques are adapted to audio content. For a speech signal, speaker segmentation or spoken content indexing can be applied. Speaker segmentation (Section 5.4) identifies speaker change points and speaker identities. Spoken content indexing and spoken document retrieval (Section 5.5) are used to extract the text or even sub-word units from speech signals and use these metadata for retrieval tasks. Music content analysis (Section 5.6) techniques are applied for music signals. The chapter concludes (Section 5.7) with a summarisation and an outlook on further research directions in the field of audio content analysis.

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know