PlumX Metrics
Embed PlumX Metrics

Speech Recognition Datasets for Congolese Languages

2023
  • 0
    Citations
  • 350
    Usage
  • 0
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

Dataset Description

This dataset contains two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: The Lingala Read Speech Corpus LRSC, with 4.3 hours of labelled audio, and the Congolese Speech Radio Corpus CSRC, which offers 741 hours of unlabeled audio spanning four significant low-resource languages of the region (Lingala, Tshiluba, Kikongo and Congolese Swahili). Collecting speech and audio for this dataset involved two sets of processes: (1) for LRSC, 32 Congolese adult participants were instructed to sit in a relaxed manner within centimetres of an audio recording device or smartphone and read from the text utterances; (2) for CSRC, recording from the archives of a broadcast station were pre-processed and curated. Congolese languages tend to fall into the “low-resource” category, which, in contrast to “high-resource” languages, has fewer datasets accessible, limiting the development of Conversational Artificial Intelligence. This results in cr...

Bibliographic Details

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know