Skip to content

webrice-is/spjallromur

 
 

Repository files navigation

Spjallromur - Icelandic Conversational Speech

About the Spjallrómur corpus

Spjallromur is an open-source conversational speech corpus for speech technology development. The corpus is 21 hours and 20 minutes long, with 54 total conversations, and 102 speakers. The data was collected over one year (September 2020 - September 2021) by Reykjavík University.

This is a revised version of Spjallromur. The original version did not include accurate timestamps, it can be found on Clarin. For this revision, word-level timestamps were recalculated by automatically aligning the audio files to the transcript. There is now also a script that transforms the audio files and transcripts into conventional short-segment ASR training data with defined test, development, and training sets.

Spjallrómur was collected using a custom-made online chatting platform called Spjall, which is Icelandic for "chat". Each speaker used their own microphones (some picked up background noise from neighboring speakers) and devices. The audio from each microphone (speaker) was saved to a separate audio file, .WAVE. There are two speakers per conversation. The speaker set contains both native and non-native Icelandic speakers. All speakers are adults. Due to some network lag, there is sometimes a small difference in the length of the two audio files within a conversation. As there was a limited number of participants, some speakers may be in more than one conversation. The dataset was primarily created for automatic speech recognition but due to the nature of the dataset, it can also be used for other speech technology fields such as speaker identification, speaker diarization, and conversational language modeling.

The transcripts were created by manual correction of ASR output. Personally identifiable information has been redacted in the audio with a 400H beep and replaced with <BLEEP> in the transcript. Partial words are marked with [HIK: ..].

  • The full conversations contain 18 hrs 20 mins of 46 full conversations, 92 speakers.
  • The half conversations contain 2 hrs 42 mins with 7 conversations.
  • Unaligned data has 1 hr and 16 mins with 3 recordings.

There were three recordings that we were unable to align. The unaligned data contains one conversation between two speakers and one half conversation, the other half was moved from full conversations in the original to half conversations in this revision. The original transcripts are still included in the folder because the text data can be useful.

To evaluate the new alignments, we manually reviewed ~300 segments. Details on this are found in the file evaluation_of_alignment.md.

The structure of the corpus

. - readme.txt
. - metadata.tsv
. - data/
        . - half_conversations/
                . - 2a139f9b/
                        . - a_2a139f9b_20-29_m.json
                        . - a_2a139f9b_20-29_m.wav
        . - full_conversations/
                . - 0f2c315c/
                        . - a_0f2c315c_30-39_f.json
                        . - a_0f2c315c_30-39_f.wav
                        . - b_0f2c315c_30-39_f.json
                        . - b_0f2c315c_30-39_f.wav
        . - unaligned/
        . - splits/

The file names are structured like <spk-id>_<unique-key>_<age>_<gender>.wav. Each audio file is 16-bit, 16000 kHz, single channel WAVE.

Prepare audio

For ease of use one can run the script prepare_audio.py to convert the audiofiles to the ASR and diarization format. Same steps are used in the recepies described below.

ASR

The reason that the data was re-aligned in this revision is for training and evaluating ASR systems on conversational data. The script run_asr_recipe.py converts the corpus into short segmented audio clips as well as splits them into train, dev, and test sets.

The sets are as follows:

train

  • Number of segments 5701
  • Duration 22.56 hours
  • 21 female and 58 male and 2 other

test

  • Number of segments 286
  • Duration ~1 hours
  • 5 female and 5 male

dev

  • Number of segments 334
  • Duration ~1 hours
  • 4 female and 4 male

Note that there is no speaker overlap between the sets, meaning speakers in the test set are not in the training set, same applies to the dev set.

The recipe also has scripts for running an ASR experiment. The results of this experiment are in results/asr/whisper-large-icelandic-30k-steps-1000h-ct2.txt

Diarization

For diarization, we provide a script that converts the corpus to a diarization-friendly format. The script run_diarization_recipe.py combines the full transcripts of the two speakers in each "full conversation". It outputs a JSON file and RTTM file.

The recipe also provides steps to replicate a diarization experiment using the pyannote. The results of the experiment are in results/diarize/diarization_results.md

Authors

Reykjavík University

  • Carlos Daniel Hernández Mena
  • Judy Y Fong
  • Staffan Hedström
  • Ólafur Helgi Jónsson
  • Lára Margrét H. Hólmfriðardóttir
  • Sunneva Þorsteinsdóttir
  • Málfriður Anna Eiríksdóttir
  • Eydís Huld Magnúsdóttir
  • Ragnheiður Þórhallsdóttir
  • Jon Gudnason - [email protected]

Tiro ehf.

  • David Erik Mollberg
  • Luke James O'Brien

Acknowledgements

Special thanks to the other members of the Language and Voice Lab (https://lvl.ru.is), the student employees, Róbert Kjaran, and Magnús Teitsson.

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science, and Culture.

This project was funded in part by the Icelandic Directorate of Labour's student summer job program in 2021.

Citations

@misc{fong-spjallromur,
title={Spjallromur - Icelandic Conversational Speech},
author={Fong, Judy Y and Hedstr{\"o}m, Staffan and J{\'o}nsson, {\'O}lafur
Helgi and H{\'o}lmfri{\dh}ard{\'o}ttir, L{\'a}ra Margr{\'e}t H. and
{\TH}orsteinsd{\'o}ttir, Sunneva and Eir{\'{\i}}ksd{\'o}ttir, M{\'a}lfri{\dh}ur
Anna and Mollberg, David Erik and Magn{\'u}sd{\'o}ttir, Eyd{\'{\i}}s Huld and
{\TH}{\'o}rhallsd{\'o}ttir, Ragnhei{\dh}ur and Gudnason, Jon},
url={},
note={{CLARIN}-{IS}},
copyright={Creative Commons - Attribution 4.0 International ({CC} {BY} 4.0)},
year={2022}
}

License

This dataset is released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. (https://creativecommons.org/licenses/by/4.0/)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.5%
  • Shell 3.5%