Resource: Donate Speech: Selected dataset This subcorpus is part of the "Donate Speech Corpus, 1.0". Shortname: puhelahjat-selected Metadata: http://urn.fi/urn:nbn:fi:lb-2022060127 License for Research Use: See LICENSE_Academic_research.txt or http://urn.fi/urn:nbn:fi:lb-2022020223 License for commercial use: The specific terms and conditions of use must be separately agreed with commercial users. For a recap of the general terms of commercial use, see LICENSE_Commercial_use.txt. NB. This resource contains personal data. You must comply with the data protection terms and conditions when processing the personal data. See the above-mentioned licenses for details. Aineisto: Lahjoita puhetta: Valikoitu aineisto Tämä korpus on osa aineistoa: "Lahjoita puhetta -aineisto, versio 1.0". Lyhytnimi: puhelahjat-annotated Kuvailutiedot: http://urn.fi/urn:nbn:fi:lb-2022060127 Lisenssi tutkimuskäyttöön: ks LICENSE_Academic_research.txt tai http://urn.fi/urn:nbn:fi:lb-2022020221 Lisenssi kaupalliseen käyttöön: Kaupallisen käytön tarkemmista ehdoista on sovittava käyttäjätahojen kanssa erikseen. Ks. kaupallista käyttöä koskevat yleiset ehdot: LICENSE_Commercial_use.txt. Huom. Aineisto sisältää henkilötietoja. Henkilötietojen käsittelyssä on noudatettava aineiston tietosuojaehtoja, ks. em. lisenssit. --- CONTENT INFORMATION This directory is part of a collection of Kaldi-style files for the Lahjoita Puhetta data release. For more information, see the paper at https://arxiv.org/abs/2203.12906 and http://urn.fi/urn:nbn:fi:lb-2022060127. The collection includes the following folders: - train-100h: Donate Speech Corpus: Training data (100h) (http://urn.fi/urn:nbn:fi:lb-2022060123) - dev: Donate Speech Corpus: Development data (10h) (http://urn.fi/urn:nbn:fi:lb-2022060121) - test: Donate Speech Corpus: Test data (10h) (http://urn.fi/urn:nbn:fi:lb-2022060122) - test-mtr: Donate Speech Corpus: Multi-transcriber test data (1h) (http://urn.fi/urn:nbn:fi:lb-2022060124) - test-mtr-s: Donate Speech Corpus: Test data from multi-transcriber speakers (10h) (http://urn.fi/urn:nbn:fi:lb-2022060125) Each folder has the following structure: - audio/ folder contains the recordings - audio has been converted to 22500Hz, 16-bit, mono, flac format - silences have been trimmed from the beginnings and ends of the recordings using the SoX command: "sox input.flac output.flac silence 1 0.05 0.5% reverse silence 1 0.05 0.5% reverse" - wav.scp: example of a Kaldi-style wav.scp file that lists the paths to the audio files, with some commands used by Kaldi - text-unfiltered: the transcripts as they were received from the transcribers, except linebreaks are replaced with ".pause" - text: transcripts after filtering out some mistakes, numerals and other rubbish - spk2utt: maps client number (assumed to be one speaker) to utterances (=recordings) - utt2spk: maps utterances to speaker - utt2{age,gender,dialect,topic,native,device,dur}: maps utterance to metadata, if the user has given this information about his or her background - age bracket - gender is either "m" (male), "f" (female), "other" or "I do not want to tell" - dialect classes are regions of Finland - the prompts (what the user is asked to speak about) for each topic are specified in the files in the folder "prompts_by_theme" - native language information was given into a free-form field by the user, so it can be messy, and it is left as it was written by the user, only lowercased - multiple answers are separated with a semicolon - device refers to the recording device, either a smart phone or a computer - dur is the duration of the recording in seconds The folder eval-scripts/ includes scripts that can be used to calculate the word/character error rate for the test sets. See eval-scripts/readme.txt for more information.