Refined OpenSLR-52 Corpus

This repository contains a refined set of transcriptions of a speech corpus for Sinhala.

Our goal is to provide a clean speech corpus for Sinhala, which is readily-usable with ASR toolkits.

We have used an open-source Sinhala speech corpus titled "Large Sinhala ASR training data set" published by Google. The complete dataset is available here.

By going through the transcriptions, we have identified two kinds of issues:

Issues related to textual characters
Issues related to linguistics

We have addressed the above issues by following a systematic approach of filtration and correction. The tasks done for each kind of issue are listed below.

Issues related to textual characters

Removing punctuation marks (e.g. . , / " " : -)
Excluding utterances which contain English characters
Replacing numerical characters with their textual forms
Removing unnecessarily applied non-printable characters (e.g. ZWJ, NZWJ, ZWSP)

Issues related to linguistics

Applying spelling corrections on misspelled words
Applying a consistent way of spacing between words, complying with Sinhala grammar

When applying the refinements, certain transcriptions were excluded because, in each such utterance, the speaker has spoken words incorrectly as those words have been misspelled in the original transcription (i.e. prompt). The statistics of the corpus after applying refinements and corresponding transcription files are listed in the below table.

Version identifier	Description	Total utterances	Unique utterances	Unique words
V0	Original, unmodified	185,293	102,576	69,581
V1	Corrected textual characters	178,409	98,435	63,376
V2	Applied lingustic corrections on V1	178,096	98,127	57,029

In addition to the above refinements, we have also tagged each speaker with his/her gender, as that information is required by ASR toolkits like Kaldi. The speaker to gender mapping is contained in this file.

We have developed Python scripts to automate most of the steps of the refinement process. Those scripts can be accessed in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Gender Information.csv		Gender Information.csv
README.md		README.md
V0.tsv		V0.tsv
V1.csv		V1.csv
V2.csv		V2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refined OpenSLR-52 Corpus

Issues related to textual characters

Issues related to linguistics

About

Releases

Packages

Contributors 2

SinSpeech-Development/Refined-OpenSLR-52-Corpus

Folders and files

Latest commit

History

Repository files navigation

Refined OpenSLR-52 Corpus

Issues related to textual characters

Issues related to linguistics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages