Speech Recognition Datasets for Congolese Languages

Publication Year2023

0
Citations
350
Usage
0
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Usage
350
- Views
  329
- Downloads
  21

Dataset Description

This dataset contains two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: The Lingala Read Speech Corpus LRSC, with 4.3 hours of labelled audio, and the Congolese Speech Radio Corpus CSRC, which offers 741 hours of unlabeled audio spanning four significant low-resource languages of the region (Lingala, Tshiluba, Kikongo and Congolese Swahili). Collecting speech and audio for this dataset involved two sets of processes: (1) for LRSC, 32 Congolese adult participants were instructed to sit in a relaxed manner within centimetres of an audio recording device or smartphone and read from the text utterances; (2) for CSRC, recording from the archives of a broadcast station were pre-processed and curated. Congolese languages tend to fall into the “low-resource” category, which, in contrast to “high-resource” languages, has fewer datasets accessible, limiting the development of Conversational Artificial Intelligence. This results in cr...

Bibliographic Details

DOI10.17632/28x8tc9n9k.1

URL IDhttps://data.mendeley.com/datasets/28x8tc9n9k; http://dx.doi.org/10.17632/28x8tc9n9k.1; https://dx.doi.org/10.17632/28x8tc9n9k.1; https://data.mendeley.com/datasets/28x8tc9n9k/1

AUTHOR(S)

Kimanuka, Ussen; wa Maina, Ciira; Büyük, Osman

PUBLISHER(S)

Mendeley Data

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know