incromin-test-calls

This repository contains published parts test cross-lingual calls, collected and annotated as part of the InCroMin project (an FSTP under the EU project UTTER).

Filename Conventions

The data are located in directories mentioning the language combination used in the call. For instance uk_hu is a call between a speaker speaking Ukrainian and a speaker speaking Hungarian while pt-BR_cs is a call between a speaker of Brazillian Portuguese and a Czech speaker.

There are at least two speakers in each call directory. We also have a call between four speakers.

The order of the languages in the directory name is preferably chosen to reflect the volume spoken in the language with the more used languages coming first.

Within each directory, the following filenames and filename suffixes are used.

Note that some files can be missing, due to pending deidentification check, lack of resources for manual annotation, insufficient consents from the original speakers or accidental data loss.

The speakers are identified using an initial. No two speakers of in the same call have the same initial. The same initial accross different calls does not mean the same speaker, of course.

In the example below, we use speaker A speaking primarily Armenian (language code hy) and B speaking primarily Czech (cs). Occasionally speakers use English, too, but that should only be an exception.

Audio files

A_hy.mp3            ... the sound of speaker A in mp3; with silence when speaker
                        B was speaking
B_cs.mp3            ... the corresponding sound of speaker B
A_hy-B_cs.mp3       ... rarely, the channels were joint into one track

Timestamped txt files

Transcripts and translations are timestamped (except a few cases where the timestamps could not be recovered easily) and follow this convention:

3 tab-delimited columns:
- columns 1-2: begining and end time of the segment, as decimal number in seconds
- column 3: text
this format matches Audacity label track format, so these files can be opened and inspected in Audacity along the corresponding audio track
whenever a file contains timestamps in this format, its name must end with .tt.txt

The corpus contains these types of timestamped files:

A_hy_whisperASR.tt.txt ... fully automatic timestamped output of Whisper ASR
                           (in some directories, this is still called A_hy.txt)
A_hy_whisperASR_corrected.tt.txt ... it's A_hy_whisperASR.tt.txt file that is
                                     manually corrected 
A_en_whisperST.tt.txt  ... fully automatic timestamped output of Whisper direct
                           sound translation from the original language to English
                           (in some directories, this is still called A_en.txt)
A_en_whisperST_corrected.tt.txt ... it's A_en_whisperST.tt.txt file that is
                                    manually corrected

Non-timestamped txt files

When the timestamps are not available or not relevant, the filenames end with just .txt, not .tt.txt.

We have these non-timestamped files:

A_has_seen_hy.txt   ... the live transcript that was shown to A in Armenian

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cs_cs_zh		cs_cs_zh
cs_it		cs_it
cs_ru		cs_ru
cs_ru_3		cs_ru_3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

incromin-test-calls

Filename Conventions

About

Releases

Packages

Contributors 2

ELITR/incromin-test-calls

Folders and files

Latest commit

History

Repository files navigation

incromin-test-calls

Filename Conventions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages