Parallel Sentence Aligned Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2020, source Lausetasolla kohdistettu suomi–selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2020, lähdeaineisto Shortname: ylenews-fi-2014-2020-selko-par-sent-src Metadata: http://urn.fi/urn:nbn:fi:lb-2024011703 Rightholder: Yleisradio License: CLARIN ACA +NC +OTHER v2.1 The complete license is available at http://urn.fi/urn:nbn:fi:lb-2022050901 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. CORPUS DESCRIPTION This is a parallel corpus created of the Yle news articles from 2014-2020 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the sentence level. It is based on the two parallel document-level datasets of Yle news articles available on Kielipankki (http://urn.fi/urn:nbn:fi:lb-2022111625 and http://urn.fi/urn:nbn:fi:lb-2024011701), also created by Anna Dmitrieva. The dataset spans the period from September 2014 to December 2020. This dataset is comprised of the following parts: 1) Sentence alignments: parallel documents from regular and Easy Finnish Yle news articles aligned sentence-by-sentence. Only the "positive" documents were taken from the 2019-2020 dataset (http://urn.fi/urn:nbn:fi:lb-2022111625). All but 50 documents were aligned automatically with Vecalign (https://github.com/thompsonb/vecalign) using LASER embeddings (https://github.com/facebookresearch/LASER). Each document has the following columns: 1.1) pair_id: an id comprised of three parts divided by a double underscore: the id of the regular document, the id of the Easy Finnish document (with a singular underscore), and the sentence pair number. 1.2) regular_string: a sentence from the regular Finnish article. 1.3) selko_string: a corresponding sentence from the Easy Finnish article. 1.4) score: the confidence score given by Vecalign. The lower the score, the more similar the sentences. The "good" pairs are estimated to have a score below or equal to 0.65; however, the score is not definitive proof of whether the sentences in the pair truly match in meaning. The zero score is assigned when a sentence has no pair. The scores for all non-zero sentence pairs in manually aligned documents are set to 0.(3). 2) Golden sentence alignments: 50 documents aligned manually by a human assessor (text). Also available in the ladder format (indexes). List of included materials: - CSV: an archive of .csv files with sentence alignments (each file is an alignment between a pair of documents); - golden_sentence_alignments_TEXT: human-made alignments of 50 texts (available in XLSX as well as in CSV); - golden_sentence_alignments_INDEXES: human-made alignments in ladder format (just indexes of sentences). For further information, please contact fin-clarin@helsinki.fi .