NAME: Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose LICENSE: This corpus is licensed with CC-BY For more information see http://urn.fi/urn:nbn:fi:lb-2018120301. These files contain data from the MA thesis study "Then shall I know fully: Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose" by Matias Tamminen (2018), University of Helsinki. The files are named as follows: [corpus]sum.props[n][differentiator1][differentiator2].rel.tsv where [corpus] is the sub-corpus from which the data is [n] is the value of n in the n-gram [differentiator1] and [differentiator2] are possible differentiators of the frequencies. These include end2, which means that the file contains only n-grams with end value 2, i.e. grams where the last member ends a sentence. o, which means that the file contains frequencies differentiated by origin, i.e. source language. The sub-corpora are ceal, which includes the Finnish part of the corpus Classics of English and American Literature translated by Kersti Juva, English-Finnish parallel corpus ska, which includes the native Finnish literary prose part of the Corpus of Translated Finnish kkamulti, which contains the multi source language part of the translated Finnish literary prose sub-corpus of the Corpus of Translated Finnish as well as the books KKAru001 and KKAru002 from the Russian-Finnish part (KKAru) of the translated Finnish literary prose sub-corpus of the Corpus of Translated Finnish kkaen, which includes the English-Finnish part of the translated Finnish literary prose sub-corpus of the Corpus of Translated Finnish. The workflow from the raw corpus files to these files in Mylly 3.12.5 is as follows: 1) The individual books are parsed with the "Parse text with UD2-Finnish model" tool of Mylly. 2) All the parsed books of the same sub-corpus are summed together with the "Sum of relations" tool with the tag field parameter kMsentence. 2.1) In ska and kkaen, the summation is done in two parts because the sub-corpora contain more books than the tool can simultaneously process. 2.2) In kkamulti, the summation is first done per source language, after which a tiny relation with the origin parameter is created with the "Make tiny relation" tool and the sums and the tiny relations are joined with the "Join relations" tool. The files created are then summed together with the "Sum of relations" tool 3) The n-grams are calculated from these sum files with the "N-grams" tool. The tool is run separately for each n value from 1 to 4. 4) The absolute frequencies are calculated from the n-gram files with the "Keep/count selected attributes" tool. The kept attributes are w[1-n]upostag as well as end and w[1-n]origin where applicable. 5) The absolute frequencies are normalized with the "Extend with proportions" tool. In the files where the absolute frequencies are differentiated by the parameter origin, the relative frequencies are grouped by w1origin. The files include the parameters wMcount, which is the relative frequency of the n-gram in question cMcount, which is the absolute frequency of the n-gram in question w[1-n]postag, which is the part-of-speech category of the member of the gram in question end, which is the end value of the gram in question (always 2 when included) w[1-n]origin, which is the source language of the translated word in question. The source languages marked by origin are de for German ee for Estonian es for Spanish fr for French ma for Hungarian ne for Dutch no for Norwegian ru for Russian sv for Swedish.