Poznan Linguistic Meeting :: PLM2006

The Regensburg Parallel Corpus: a bootstrap multilingual parallel corpus of Slavonic

Ruprecht von Waldenfels (University of Regensburg)

The Regensburg Parallel Corpus is a parallel corpus of Slavonic compiled with the help of a framework that aims to reduce human intervention in the compilation of a parallel corpus of arbitrary (Slavonic) languages to a minimum.

The work load involved in manually aligning a parallel corpus is immense, and the need for preprocessing of data and parameter tuning that is necessary for many automatic alignment programs is, presumably, one of the reasons for the fact that there are not many parallel corpora compiled. Main objectives during the development of the corpus were consequently ease of maintainance and ease of augmentation, both with new texts and with new languages. More specifically, care was taken that as little manual preprocessing as possible is needed, in this way encouraging users to help augment the corpus according to their own needs.

All alignment tasks are done fully automatically with the help of a modified version of Robert Moore's Bilingual-Sentence-Aligner (Moore 2002), an automatic aligner utilizing both sentence length and lexical correspondences as heuristics. Since the morphology of heavily inflecting languages such as the Slavonic languages introduces much more noise than it is the case for languages such as German or English, lemmatized variants are used in the alignment process. Inspection shows that this greatly improves the quality of the alignment; however, this is yet to be shown to be statistically significant.

Since taggers, lemmatizers or other mark-up utilities are easily available only for some languages, the corpus accomodates different levels of linguistic annotation for individual languages. No annotation is necessary for inclusion, while additional annotation can be added at later stages.

Preprocessing is kept as language independent as possible. Sophisticated means of sentence splitting, for example, are not used by default. This is a compromise in quality in some respects, but renders the addition of new languages very simple.

The system is designed in such a way that the minimum preprocessing necessary is the conversion of texts to unicode text files, and, if a lemmatizer for this language is available, the lemmatization of a word list. Given such files, a number of centralized scripts perform sentence splitting, conversion to XML and inclusion of lemmata at a basic, context free level. The data is then aligned to all other translations of this text. The resulting storage format is XML with stand-off alignment annotation, which is then converted to ParaConc (Barlow 2002) and IMS Corpus Workbench (Christ 1994) format for viewing and querying. Additional interfaces are easy to implement.

At the moment, the corpus is composed of English, German, Polish, Russian, Slovak, Serbian, Croatian and Ukrainian post-war belletristic texts, the largest being the Russian subcorpus with a size of 2 million tokens.

In my talk, I present the corpus conception and architecture, and report on experimental statistical evaluation of the influence of lemmatization on the alignment algorithm.

References

Barlow, M. (2002). ParaConc: Concordance software for multilingual parallel corpora. Language Resources for Translation Work and Research: 20-24.

Christ, O. (1994). The IMS Corpus Workbench Technical Manual. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.

Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual corpora. In: AMTA '02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, London: 135-144.