OPUSPARCUS 1.0 Mathias Creutz, 4 April 2018 1. General This archive contains the first release of Opusparcus, a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The construction of the Opusparcus corpus has been described in the following conference paper: Mathias Creutz (2018). Open Subtitles Paraphrase Corpus for Six Languages. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), 7-12 May, Miyazaki, Japan. Please cite the above paper in any work that utilizes any part of the Opusparcus corpus. The data in Opusparcus has been extracted from OpenSubtitles2016 (http://opus.nlpl.eu/OpenSubtitles2016.php), which is in turn based on data from http://www.opensubtitles.org/. 2. License Opusparcus 1.0 is licensed under the latest version of the Creative Commons CC BY-NC (Attribution-NonCommercial) license. You may copy and redistribute the material in any medium or format, and you may remix, transform, and build upon the material. However, you must give appropriate credit (see instructions above), provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. Read more at: http://urn.fi/urn:nbn:fi:lb-2018021221 3. Data sets Opusparcus contains training, development and test sets for six European languages: German (de), English (en), Finnish (fi), French (fr), Russian (ru), and Swedish (sv). The training sets are orders of magnitudes larger than the development and test sets and consist of lists of automatically ranked sentence pairs, where a high rank means a higher probability that the two sentences are paraphrases. The development and test sets consist exclusively of sentence pairs that have been annotated manually. This is to guarantee the high quality of these sets. However, quality comes at the expense of quantity, so the development and test sets are smaller than the training sets. The development sets can be used to refine whatever training algorithms one might want to devise. The test sets should be used in final evaluations only. It is important to keep the test sets aside while developing new methods, since otherwise the test sets may affect the design or parameter optimization of the methods under development. 4. File formats 4.1 Training sets The training sets are stored as compressed bzip2 files in the train directories of each language. The text is encoded using UTF-8. Each file consists of lines that contain the following seven tab separated fields: Field 1: Sentence pair ID. The ID consists of the language ID, followed by a hyphen, followed by the letter N (for traiN), followed by a number. Field 2: First sentence of the sentence pair. The sentence has been tokenized and normalized to some degree; for instance, punctuation at the end of the sentence other than a question mark (?) has been converted to a single period. Field 3: Second sentence of the sentence pair. The sentence has been tokenized in the same way as the first sentence. These two sentences are potential paraphrases. The lower the sentence pair number, and the earlier in the file the sentence pair appears, the higher the likelihood is that the two sentences are paraphrases. See Table 2 in the LREC 2018 paper for an assessment of the quality of the paraphrase candidates. Field 4: This value corresponds to Equation 5 in the LREC paper. It is the sum of pointwise-mutual information (PMI) values of the sentence pair, accumulated over all pivot language corpora. This is the score that was utilized to rank the sentence pairs and order them into the file, highest score first (most likely to be paraphrases), lowest score last (least likely to be paraphrases). Field 5: Expected number of times the two sentences in the sentence pair would be aligned with each other when translated to the pivot languages and back. This is the same as the joint probability in Equation 2 of the LREC paper, multiplied by the total number of sentence pairs in the corpus. That is, instead of indicating a probability, we indicate an expected frequency here. This value can be used as an alternative ranking criterion, as is reported in the paper. If used for ranking, a higher value indicates a higher likelihood of being paraphrases. Field 6: Number of pivot languages in which the two sentences in this sentence pair have a translation in common. The highest possible value is 5, which means that in all other languages these sentences have a common translation. For instance, when the value is 5 for the English sentence pair "Get up , Sam ." vs. "On your feet , Sam .", this means that in all five other languages (de, fi, fr, ru, sv) there exists a single sentence that translates to both of these English sentences. The lowest possible value is 1, which means that only one other language directly supports the hypothesis that these two sentences are paraphrases. Field 7: (Adjusted) edit distance between the two sentences in this sentence pair. The edit distance is not used in the ranking of the sentences, but the edit distance can be used as a filter for finding "more interesting" paraphrases, which do not differ from each other in just one or a few characters. This adjusted edit distance is computed without taking into account the "tails" of the longer of the two sentences. For instance, the adjusted edit distance between the sentences "Frankfurt , Germany ." vs. "Oh , Frankfurt , Germany ." is zero, because the first shorter sentence fits within the second longer one without any modifications. 4.2 Development and test sets The development and test sets are stored as uncompressed plain text files in the dev and test directories of each language, respectively. The format of the dev and test files are the same. The text is encoded using UTF-8. Each file consists of lines that contain four tab separated fields, as follows: Field 1: Sentence pair ID. The ID consists of the language ID, followed by a hyphen, followed by the letter D or T (D for Development, T for Test), followed by a number. Field 2: First sentence of the sentence pair. The sentence has been tokenized and normalized to some degree; for instance, punctuation at the end of the sentence other than a question mark (?) has been converted to a single period. Field 3: Second sentence of the sentence pair. The sentence has been tokenized in the same way as the first sentence. These two sentences are potential paraphrases. Whether they are paraphrases or not is indicated in the fourth field. Field 4: Average score given by two independent annotators. The scores given by an individual annotator are: 4) Good example of paraphrases = Dark green button in the annotation tool, 3) Mostly good example of paraphrases = Light green button in the annotation tool, 2) Mostly bad example of paraphrases = Yellow button in the annotation tool, 1) Bad example of paraphrases = Red button in the annotation tool. See the LREC paper for a more extensive description of the annotation procedure and guidelines. If the two annotators fully agreed on the category, the value in this field will be 4.0, 3.0, 2.0 or 1.0. If the two annotators chose adjacent categories, the value in this field will be 3.5, 2.5 or 1.5. For instance, a value of 2.5 means that one annotator gave a score of 3 ("mostly good"), indicating a possible paraphrase pair, whereas the other annotator scored this as a 2 ("mostly bad"), that is, unlikely to be a paraphrase pair. If the annotators disagreed by more than one category, the sentence pair was discarded and won't show up in the development or test set files.