HFST-SweNER - Swedish named-entity recognizer using HFST Pmatch =============================================================== Version 0.9.3 (beta) Introduction ------------ This package contains a beta version of HFST-SweNER, a rule-based named-entity recognizer (NER) system for Swedish, implemented using a pipeline of HFST Pmatch (pattern matching) finite-state transducers (FSTs). The recognizer is based on (was converted from) the original implementation in Flex and Perl, developed by Dimitrios Kokkinakis at the University of Gothenburg. The recognizer was converted to use HFST Pmatch at the University of Helsinki. Please note that this is a beta release and the recognizer has a few know bugs and deficiencies, see below. Package contents ---------------- This package contains the following files and directories: configure - A script for configuring the HFST-SweNER Makefile.in - A template for a makefile for (re)compiling HFST-SweNER README - This file INSTALL - Generic configuration and installation instructions pmatch/ - Precompiled Pmatch FSTs scripts/ - Auxiliary scripts for compiling and running the recognizer and for processing and comparing its output src/ - Pmatch source files src/flex/ - Original Flex source files, slightly modified before conversion src/gazetteer/ - Original gazetteer (name database) files src/gazetteer-pm/ - Gazetteer files converted for HFST Pmatch, only needed at compile time hfst-bin/ - Pre-built, statically linked binaries of HFST Pmatch and other required HFST tools build-aux/ - Auxiliary scripts for configuring and installing the system Prerequisites ------------- The makefile and scripts for compiling and running HFST-SweNER require a Linux or a similar Unix-type system with several GNU tools. For compiling and running the NER pipeline, you need a recent version of the HFST Pmatch tool, as packaged in HFST version 3.8.2 or newer, and a few other HFST tools. This package comes bundled with pre-built, mostly statically linked binaries of the required tools for 64-bit x86 GNU/Linux, but they might not work in older systems. Alternatively, and for other platforms, you can download HFST from SourceForge: http://sourceforge.net/projects/hfst/files/hfst/ Or, to compile the latest revision of HFST yourself, check it out from the Subversion repository, configure, compile and install it: svn checkout svn://svn.code.sf.net/p/hfst/code/trunk/hfst3 cd hfst3 ./configure [options] scripts/generate-cc-files.sh make && make install The hfst-swener script (alias runNer-pm) for running HFST-SweNER requires Bash, iconv and Perl 5.x. The makefile for compiling HFST-SweNER requires GNU Make, GNU M4 and Perl 5.x. The auxiliary scripts require Python 2.6.x or 2.7.x. Installation ------------ HFST-SweNER has an Autoconf-based configuration and installation. The file `INSTALL' contains generic configuration and installation instructions. To configure the system, execute ./configure [options] [HFSTDIR=DIR] in the top directory. If you build HFST-SweNER on a 64-bit x86 GNU/Linux system, the configuration uses the packaged HFST binaries by default. To use an existing HFST installation instead, specify the option `--without-bundled-hfst-tools'. To use a HFST binaries in a directory DIR not in `$PATH', specify `HFSTDIR=DIR' on the command line. Perhaps the most relevant of the standard `configure' options is `--prefix=DIR' for specifying DIR as the directory prefix under which to install HFST-SweNER (default: `/usr/local'). For more information about `configure' options, please run `./configure --help' or refer to the generic instructions in the file `INSTALL'. After configuring the system, run make If you have not made any changes to the Pmatch source files, `make' only generates some scripts with configuration information added. To check that the system works as expected, run make check This currently runs only a very simple test. To install the system in the installation directory, run make install Running ------- The whole HFST-SweNER pipeline can be run with the Bash script `hfst-swener' in directory `scripts'. The basic usage of the script is: hfst-swener [options] [input files] [> output] For a more detailed usage and a description of options, run hfst-swener --help If input files are not specified, the script reads from the standard input. The default input character encoding is UTF-8; another encoding can be specified with option `--input-encoding'. The encodings supported are those of the `iconv' program. The script uses by default the HFST Pmatch found in the HFST binary directory specified for `configure' (or found in `$PATH'). To use HFST Pmatch residing elsewhere, you can either specify the option `--progdir=HFSTDIR' where HFSTDIR is the directory, or set the value of the environment variable `NER_BINDIR_PMATCH' to HFSTDIR. By default, hfst-swener writes its output to standard output. If the option `--output-to-file' is specified, hfst-swener produces its output for input file FILE to a file named `FILE.ner-pm' in the current directory, unless otherwise specified with options `--output-name' and `--output-dir'. The output contains named entities marked with XML-style tags of the same kind as the original implementation. The output character encoding is UTF-8. The script can optionally generate intermediate files for the output of each recognizer in the pipeline; see options `--tee', `--names', `--name-options', `--diff', `--diff-only', `--clean-diff'. If you are short of memory (less than 4 GiB), you can specify the option `--all-tempfiles', so that the recognizer and correction filter of each recognition stage are run separately with a temporary file in between, not piped. If you have plenty of memory (24 GiB or more) and can run at least 600 processes simultaneously, you can specify the option '--no-tempfiles' to use pipelines also between recognition stages. With the option `--flex', the script can also be used to run the original implementation of the Swedish NER system. Recompiling ----------- If you modify the Pmatch source files, they need to be recompiled for the changes to take effect. You can recompile only the changed files by running `make pmatch' in the top directory. With the default settings, HFST-SweNER Pmatch FSTs compile relatively fast: the slowest ones may take 10 minutes or more, depending on the speed of the computer. Gazetteers ---------- The gazetteer source files are in the directory `src/gazetteer'. They are in the format used in the original Perl implementation of the gazetteer lookup. If you modify the gazetteers, you need to recompile the gazetteer lookup FSTs by running `make pmatch' for the changes to take effect (see above). If you want to use gazetteer files residing elsewhere, you can override the makefile variable `GAZETTEER_SRCDIR'. For example: make pmatch GAZETTEER_SRCDIR=~/ner/gazetteer The gazetteer files should nevertheless be named `nameDb1.txt', `nameDb2.txt' and `nameDb3.txt' for one-, two- and three-word names, respectively. The files in the directory `src/gazetteer-pm' are intermediate gazetteer files for HFST Pmatch. They are only used at compile time, not at run time. Known bugs and deficiencies --------------------------- In general, most of the names recognized by the original implementation are also recognized by this implementation, but not all of them. There is little documentation. Contact information ------------------- If you have questions, comments or bug reports, please contact the authors by email: Krister Lindén, NER project leader, krister.linden@helsinki.fi Jyrki Niemi, NER developer and packager, jyrki.niemi@helsinki.fi Sam Hardwick, HFST Pmatch developer, sam.hardwick@iki.fi