Psycholinguistic Descriptives Introduction This material comprises a dataset of word frequencies from six different corpora and a simple query tool for extracting often used psycholinguistic descriptives for given words. The word frequency tables have been filtered to better reflect actual word frequencies. For more and up-to-date information on the dataset see: http://urn.fi/urn:nbn:fi:lb-2018081601 This dataset is licesed with the Creative Commons Attribution 4.0 International License: https://creativecommons.org/licenses/by/4.0/ Requirements Word frequency data is formatted as .csv tables and as such can be used with any program. The query tool requires Python 3 as well as the modules: Pandas (https://pandas.pydata.org/ version 0.20 or newer) FinnSyll (https://pypi.org/project/FinnSyll/) Both are available through 'pip' python package manager (https://pip.pypa.io/en/stable/installing/) Query tool The query tool can be used to obtain descriptives for a list of words. At this time the descriptives include: 1. surface or lemma frequencies: - corpus specific relative frequencies - relative frequency in the total sum of chosen corpora - average relative frequency in the chosen corpora 2. syllable information - identity - count - frequencies - average frequency 3. letter 2-gram and 3-gram average frequency 4. ortographic neighbours information (words within Hamming distance of 1) - identity - count Help on how to use the query tool can be found in the programs --help message. Frequency tables Frequency tables are separated for lemmas and surface forms. Both have been composed with the same methods. Tokens with same written form are considered unique if they do not share the same part-of-speech tag. Tokens and part-of-speech tags have been extracted from texts parsed with Finnish Dependency Parser(http://turkunlp.github.io/Finnish-dep-parser/). Tokens have been filtered (see below) to make the frequency values better reflect actual word frequencies. The filtered frequency tables for the surface forms were used to calculate letter 2-gram and 3-gram as well as syllable frequencies. These frequencies were first calculated and normalized per corpus and then averaged across corpora to reduce the effect of different corpus sizes. The corpora used in making the word frequency tables: The Suomi24 Corpus (S24): http://urn.fi/urn:nbn:fi:lb-2017021630 Newspaper and Periodical Corpus of the National Library of Finland (KLK, only from 1980 onwards): http://urn.fi/urn:nbn:fi:lb-2016050302 Finnish Magazines and Newspapers from the 1990s and 2000s (LEHDET): http://urn.fi/urn:nbn:fi:lb-2017091901 Finnish Wikipedia 2017 (WIKI): http://urn.fi/urn:nbn:fi:lb-2018060401 Finnish OpenSubtitles 2017 (OPENSUB): http://urn.fi/urn:nbn:fi:lb-2018060403 Data retrieved from the website in making the word frequency tables (REDDIT): The Reddit/r/Suomi: https://old.reddit.com/r/Suomi/ Token filtering: 1. Remove tokens longer than 30 characters. 2. Remove tokens categorized as punctuation, symbols or foreign words. 3. All tokens are changed to lowercase. 4. Per corpus, tokens with occurrence lower than limit are removed. Limit is manually decided, but is approximately 0.01 relative frequency (tokens per million) for each corpus. 5. Remove tokens that contain characters not in the regex set: [0-9abcdefghijklmnopqrsštuvwxyzžåäö\-\'\:\.]. 6. Remove tokens where "special" characters (regex: [0-9\-\'\:\.]) make up more than 75% of all characters. 7. Remove tokens that are present in only a single corpus. Filtering results: Surface forms: N tokens(millions) Unique tokens S24: Pre 2278.5 43539346 Post 2088.4 (-8.3%) 983682 (-97.7%) KLK: Pre 122.0 6299766 Post 101.4 (-16.9%) 1032223 (-83.6%) LEHDET: Pre 136.3 8580856 Post 108.2 (-20.6%) 1066805 (-87.6%) WIKI: Pre 83.3 4044413 Post 61.4 (-26.3%) 856182 (-78.8%) REDDIT: Pre 38.2 1966899 Post 30.2 (-21.0%) 512325 (-74.0%) OPENSUB: Pre 267.6 3430478 Post 196.4 (-26.6%) 664655 (-80.6%) TOTAL: Pre 2926.0 56292881 Post 2586.1 (-11.6%) 1539918 (-97.3%) Lemma forms: N tokens(millions) Unique tokens S24: Pre 2278.4 31964747 Post 2119.3 (-7.0%) 408915 (-98.7%) KLK: Pre 121.9 4077418 Post 103.2 (-15.3%) 475322 (-88.3%) LEHDET: Pre 136.3 6184952 Post 110.1 (-19.2%) 501829 (-91.9%) WIKI: Pre 83.3 2468355 Post 62.7 (-24.8%) 443610 (-82.0%) REDDIT: Pre 38.2 1070158 Post 30.7 (-19.6%) 215858 (-79.8%) OPENSUB: Pre 267.6 1692284 Post 198.7 (-25.8%) 287327 (-83.0%) TOTAL: Pre 2925.8 41938288 Post 2624.6 (-10.3%) 747720 (-98.2%) Known issues: The S24, KLK and LEHDET corpora are parsed with an older versions of the Turku Dependency Parser than the WIKI, REDDIT and OPENSUB corpora. Because of this the part-of-speech tags have a few clear discrepancies. For example the lemma 'ensimmäinen' is considered a numeral by the older version, while the newer version tags it (correctly) as an adjective. If part-of-speech tags are not relevant, the POS class information can be ignored in the query tool with '-pc IGNORE' argument, this will collapse all instances of identical written forms. For more questions about the data gathering, filtering or query tool email: tatu.huovilainen@helsinki.fi