Resource: Donate Speech: Selected dataset
This subcorpus is part of the "Donate Speech Corpus, 1.0".
Shortname: puhelahjat-selected
Metadata: http://urn.fi/urn:nbn:fi:lb-2022060127

License for Research Use: See LICENSE_Academic_research.txt or
http://urn.fi/urn:nbn:fi:lb-2022020223

License for commercial use: The specific terms and conditions of use must be separately agreed with commercial users. For a recap of the general terms of commercial use, see LICENSE_Commercial_use.txt.

NB. This resource contains personal data. You must comply with the
data protection terms and conditions when processing the personal
data. See the above-mentioned licenses for details.


Aineisto: Lahjoita puhetta: Valikoitu aineisto
Tämä korpus on osa aineistoa: "Lahjoita puhetta -aineisto, versio 1.0".
Lyhytnimi: puhelahjat-annotated
Kuvailutiedot: http://urn.fi/urn:nbn:fi:lb-2022060127

Lisenssi tutkimuskäyttöön: ks LICENSE_Academic_research.txt tai
http://urn.fi/urn:nbn:fi:lb-2022020221

Lisenssi kaupalliseen käyttöön: Kaupallisen käytön tarkemmista ehdoista on sovittava käyttäjätahojen kanssa erikseen. Ks. kaupallista käyttöä koskevat yleiset ehdot: LICENSE_Commercial_use.txt.

Huom. Aineisto sisältää henkilötietoja. Henkilötietojen käsittelyssä
on noudatettava aineiston tietosuojaehtoja, ks. em. lisenssit.


---
CONTENT INFORMATION

This directory is part of a collection of Kaldi-style files for the Lahjoita Puhetta data release. For more information, see the paper at https://arxiv.org/abs/2203.12906 and http://urn.fi/urn:nbn:fi:lb-2022060127.

The collection includes the following folders:
- train-100h: Donate Speech Corpus: Training data (100h) (http://urn.fi/urn:nbn:fi:lb-2022060123)
- dev: Donate Speech Corpus: Development data (10h) (http://urn.fi/urn:nbn:fi:lb-2022060121)
- test: Donate Speech Corpus: Test data (10h) (http://urn.fi/urn:nbn:fi:lb-2022060122)
- test-mtr: Donate Speech Corpus: Multi-transcriber test data (1h) (http://urn.fi/urn:nbn:fi:lb-2022060124)
- test-mtr-s: Donate Speech Corpus: Test data from multi-transcriber speakers (10h) (http://urn.fi/urn:nbn:fi:lb-2022060125)

Each folder has the following structure:
- audio/ folder contains the recordings
    - audio has been converted to 22500Hz, 16-bit, mono, flac format
    - silences have been trimmed from the beginnings and ends of the recordings using the SoX command:
        "sox input.flac output.flac silence 1 0.05 0.5% reverse silence 1 0.05 0.5% reverse"
- wav.scp: example of a Kaldi-style wav.scp file that lists the paths to the audio files, with some commands used by Kaldi
- text-unfiltered: the transcripts as they were received from the transcribers, except linebreaks are replaced with ".pause"
- text: transcripts after filtering out some mistakes, numerals and other rubbish
- spk2utt: maps client number (assumed to be one speaker) to utterances (=recordings)
- utt2spk: maps utterances to speaker
- utt2{age,gender,dialect,topic,native,device,dur}: maps utterance to metadata, if the user has given this information about his or her background
  - age bracket
  - gender is either "m" (male), "f" (female), "other" or "I do not want to tell"
  - dialect classes are regions of Finland
  - the prompts (what the user is asked to speak about) for each topic are specified in the files in the folder "prompts_by_theme"
  - native language information was given into a free-form field by the user, so it can be messy, and it is left as it was written by the user, only lowercased
    - multiple answers are separated with a semicolon
  - device refers to the recording device, either a smart phone or a computer
  - dur is the duration of the recording in seconds

The folder eval-scripts/ includes scripts that can be used to calculate the word/character error rate for the test sets. See eval-scripts/readme.txt for more information.