Samples of Spoken Finnish, VRT version Suomen kielen näytteitä, VRT-versio Short name: skn-vrt URN: http://urn.fi/urn:nbn:fi:lb-2021112221 License: CC-BY Licensor: Institute for the Languages of Finland Distributor: The Language Bank of Finland / FIN-CLARIN Description This package contains the transcript data for the Samples of Spoken Finnish in the VRT (VeRticalized Text) format as used in the Language Bank of Finland. The data corresponds to that in Korp, except that obsolete LAT links have been removed. Please see also http://urn.fi/urn:nbn:fi:lb-201407141 for more information on the Samples of Spoken Finnish corpus in general. The data has been automatically annotated using an old version of the Turku Dependency Parser Pipeline (TDPP) from Turku NLP, based on manually added standard Finnish word forms of the original dialect words. The directory "vrt" contains the data split into 99 VRT files so that each original sample is in its own file. The file name contains the number of the sample and the parish; e.g., SKN01a_Suomussalmi.vrt. The VRT files contain XML-style tags for nested structural markup (texts, paragraphs and sentences) and associated annotations (metadata) as attributes. Each token is on its own line, attributes separated by TAB characters. In addition, the files contain XML-style comment lines at the beginning (and end). The first comment line lists the names of the token (positional) attributes (as used internally in Korp) in the order they are listed for each token: - word: standard Finnish word form - original: original dialectal form (detailed transcription) - normalized: rough dialectal form without diacritics - comment: note on the word - id: the number of the token in the sentence - ref: the number of the token in the sentence as used for dependency heads - lemma: base form of the standard Finnish word form - lemmacomp: base form with compound boundaries marked with a "|" - pos: part of speech - msd: morphological analysis (morpho-syntactic description) - dephead: dependency head number, referring to attribute ref (0 if no head) - deprel: dependency relation - nertag: name tag - nerbio: "B": begins a name; "I": within a name; "O": outside a name Note that the base form, part of speech, morphological analysis, dependency relations and name information have been added by programs and not manually corrected, so they contain errors. See also https://www.kielipankki.fi/tuki/korp-tdt/ for some information on these annotations produced by TDPP in Finnish. A missing value for a token attribute is indicated by an underscore ("_"). Each VRT file contains a single text element (structure) with the following attributes: - name: name of the file - title: title of the sample leaflet - editor: editor of the sample - parish: dialect parish - dialect_group: dialect group - dialect_region: dialect region - date: year of publication of the sample leaflet The attributes datefrom, dateto, timefrom, timeto and _geo_parish are used internally in Korp. Each paragraph element corresponds to one turn of either an interviewer or interviewee, with the following attributes: - id: paragraph number - speaker: speaker initials - sex: "M" (male), "N" (female) or "NA" (not known) - role: "haastateltava" (interviewee) or "muu" (usually interviewer) Each sentence element has the following attributes: - id: sentence number - origid: sentence identifier in the original data - beg: sentence begin time in the original recording - duration: duration of the sentence in the recording In addition, tokens recognized as name, numeral or temporal expressions are enclosed in ne elements with the following attributes: - name: name (or numeral or temporal expression) - fulltype: full expression type - ex: category: "ENAMEX" (name), "NUMEX" (number), "TIMEX" (time) - type: main type of expression - subtype: subtype of expression - placename: name if it is a place name, empty otherwise - placename_source: "ner" if the name is a placename, empty otherwise