=====================================================
Aalto University DSP Course Conversation Corpus 2013-
=====================================================


1. INTRODUCTION

This version of the DSPCON corpus contains transcribed recordings of Finnish conversations 
by Digital Signal Processing course students in Aalto University, Finland, from the years 
2013 to 2015. The intention has been to use the data to build better models for automatic 
speech recognition of conversational Finnish.

160 different male students and 21 female students had conversations in pairs,
recorded their own conversations, and transcribed at least 20 utterances each.
In total they contributed 3926 utterances which adds up to 7.4 hours of audio.

The sound was recorded using Logitech USB headsets, PC 960 and H390. In 2013 and
2014, Labtec headsets were used also.


2. DIRECTORY STRUCTURE

The data collected each year is organized into its own directory. The recordings
and transcripts from each student are in <year>/students/<student> directory,
where <student> is a student ID of the format dsp<year>__<speaker> and <speaker>
is a speaker ID. Male speakers have speaker IDs dspmXXX and female speakers have
IDs dspfXXX, where XXX is a running number.

The alignments directory contains forced-alignments, i.e. timestamps assigned to
each phoneme in each transcribed word. These have been created using AaltoASR,
the Aalto University speech recognizer, and saved in AaltoASR .phn file format.
Rough word-level segmentations have been deduced from the forced-alignments and
saved in Praat TextGrid format. Note that these correspond to the most probable
word and phoneme segments as given by the acoustic model that was used - not
necessarily the linguistically exact segments. The empty phoneme intervals
that appear at word boundaries were inserted by the automatic aligner and may not
correspond to actual pauses in the speech signal.

3. TRANSCRIPTION

Corrections and updates have been made to the original transcripts created by
the students. There are two kinds of transcripts: verbatim.trn contains exact
phonetic transcripts suitable for acoustic model training and normalized.trn
contains "normalized" transcripts suitable for evaluation. The normalized
transcripts contain alternations for different pronunciation of the same word.
Normalization is incomplete and has only been done for certain recordings.

Two garbage tokens have been used, [laugh] to denote laughter and [reject] to
denote other noise that cannot be transcribed. Interrupted words are written
down ending in a minus sign.


4. FILE FORMATS

Audio files have been saved in Microsoft WAVE format. Sample format is 44 kHz
16-bit PCM.

Transcripts have been saved in trn format specified in NIST Scoring Toolkit.
Each line contains a word sequence, follow by an utterance ID enclosed in
parenthesis. Transcript alternations are used in normalized transcripts to allow
alternative pronunciations. Alternative forms are separated by a slash sign (/)
and enclosed in curly brackets. An at sign (@) represents an empty word; when
scoring a text, a missing word will not be counted as an error, if @ is
specified as its alternative. Example:

{ mm / @ } { en minä / emmä / en mä / emminä } { tiä / tiiä / tiädä / tiedä }