Greg Krynicki's PhD


Abstract

My PhD presents different methods of parallel corpus collection, preprocessing and alignment. It is shown how these methods may be applied to texts including Polish. The improvements to the existing algorithms are suggested in order to obtain better results of alignment and deal with phenomena specific for Polish. Corpora including Polish, which is as a strongly inflected language, may require stemming/lemmatisation prior to alignment if any of the two methods: Melamed 1997 or Moore 2002 are used for alignment. It was shown that results of the alignment by algorithm presented in Halácsy et al. 2005 may be worse on stemmed/lemmatised data than on raw data.

The result of the research is the English-Polish Parallel Corpus (EPPC) that includes aligned, tokenised, stemmed, lemmatised and annotated texts. It is the first annotated and aligned bilingual corpus of that size with the Polish language.

EPPC consists of clean and noisy texts, exact and liberal translation. Clean parts are mainly literary sections. Noisy ones are the EURLex resources. On the other hand, liberal translations are generally the literary sections of EPPC and exact translations are the EURLex sections. Tests show that most available sentence aligners (GMA, moore, hunalign) tackle the problem of noise in the text much better than liberality of the translation.

One of the most popular methods of length-based sentence alignment presented in (Church – Gale 1991b) also produced relatively good alignment of clean texts that were manually preedited for paragraph boundaries. Church and Gale’s algorithm, however, was shown to be inadequate for alignment of liberal translations without manual prealignment at the paragraph level. Two runs of Church and Gale’s algorithm, first for the automatic alignment of paragraphs and then automatic alignment of sentences did not much affect the alignment result. Therefore, in cases where the texts to be aligned are not clean enough to assume their alignment at the paragraph level, Church – Gale 1991b method should not be applied even when applied twice for both paragraph and sentence alignment.

The conclusion that could be drawn from this study is similar to conclusions drawn in two previous studies comparing a similar set of aligners for other pairs of languages (Rosen 2005 and Singh – Husain 2005): the choice of the alignment method should be informed by the properties of the texts to align and the application of the results. In general classification, hunalign (Halácsy et al. 2005) wins on recall and F measure. Therefore, it should be chosen when the balance between precision and recall is important. Hunalign does not benefit from morphologically analysed data. Therefore, it would be preferred in situations where monolingual morphological tools are not available. Other advantages of hunalign include its flexibility, multiple options, speed and no requirement to input large amount of bitext.

The second most successful algorithm in general F measure classification is GMA aligner (Melamed 1997). Due to its geometric mapping alignment using cognates and other lexical clues, GMA turns out to be relatively successful in aligning liberal translations. Nevertheless, GMA is very much language dependent: its success rate depends on the size of the provided dictionary, stop lists, the presence of morphological analysis and the typological closeness of the aligned languages.

In general classification, moore (Moore 2002) turned out to be the least successful of the three algorithms. Additionally, the fact that moore produces only 1-1 alignments may be of serious disadvantage wherever parallel texts are scarce or the recall measure of alignment quality is of special importance: for example in legal texts the context of the aligned sentence pair may be crucial to the interpretation of the sentences. With 1-1 alignments some of the neighbouring sentences of the aligned sentence pair may be unavailable. Moreover, moore requires the input bitext of at least 10 000 sentence pairs, which could be a barrier when only small amounts of bitext are available. Technical drawbacks of moore implementation include very high computational cost, poor handling of Polish characters and no possibility to change the output format. In cases, however, where the resources are abundant and the precision of alignment is the key factor (e.g. for the extraction of bilingual lexicons, for translation memory databases or EBMT databases), moore seems to be the best choice, especially when the corpus to be aligned is morphologically analysed. Its additional advantage is language independence. No external dictionary and no list of stopwords is required.

In general, the factors that should influence the choice of the alignment method include:

  • is the translation literal or free (periphrastic treatment of the original in Peter Pan);
  • is the translation corpus clean or noisy (errors of formatting, tokenisation);
  • is the text structure of the original well preserved in the target text (e.g. paragraphs and section boundaries are preserved);
  • is the translation complete in terms of chapters, sections, paragraphs;
  • typological differences between the aligned languages;
  • size of the corpus we have at our disposal;
  • error rates we are ready to accept in our application in terms of precision, recall and variously combined F-measures;
  • the amount of human processing we can afford in terms of pre- and post-processing. Summary

11 Oct 2006