Preliminaries on Old Bulgarian corpus building

Tsvetana Dimitrova (The Norwegian University of Technology and Sciences, Trondheim)

In this pilot study, we are reporting some recent attempts on processing of diachronic data of Old Bulgarian texts. This includes description of the process of corpus building, annotation strategy and application of extraction rules to corpus data of Old Bulgarian texts based on a text sample from Codex Marianus (Jagic, 1883/1960) and Codex Suprasliensis [2] (Zaimov & Capaldo, 1982; Vulchanov, 2005). The preliminary process in creating a corpus of Old Bulgarian manuscripts includes morphological annotation and the first attempts for nominal phrases extracting (including prepositional, quantificational, and adjectival phrases), using CLaRK system. (Simov et al. 2001)

Due to a number of empirical problems, concerning the raw data, such as defining clause and phrase boundaries, as well as the basic word form (lemma), variation in orthography and errors, we decide to start the process of linguistic annotation with pos-tagging of nominal constituents, using the including mainly specification of grammatical information about a word token, based on a predefined tag-set.

A possible strategy for a semi-automatic tagging is normalization of the texts - a strategy that has not been followed so far. The annotation has been applied manually. However, the process of manual annotation has led us to consider building of a small lexicon that will consist of non-changeable word-forms, such as particles, prepositions, as well as some relatively stable word forms. As a first attempt, it will be possible to build a lexicon for one text.

The process of manual pos-tagging of NP-internal constituents has also led us to a formulation of the basic pattern of NP-internal structure in the sample text (excerpt from Codex Marianus). This stage has been seen as facilitating the semi-automatic annotation of complete NPs, based in concord relations displayed NP-internally. The observations made on the shallow-parsed data from another text (Codex Suprasliensis), consisting of manually extracted nominal phrases (Vulchanov, 2005), have validated most of the proposed NP-internal structure (Dimitrova-Vulchanova, M. & V. Vulchanov 2003; Dimitrova-Vulchanova, M. & V. Vulchanov, in press) for the sample text from Codex Marianus, XML marked-up.

The next step was formulating regular grammars for extracting the NP-phrases, using the CLaRK system (Simov et al. 2001). We have applied grammar rules on a selection of corpus data and compare the extracted phrase units with a selection of manually extracted phrase units. For example, we have applied on the text of (Mar, Mk. 2) a grammar for a nominal phrase with a noun element as a lexical head and an adjectival element in pre- and post-position. The grammar has extracted all 20 NPs, which have been extracted manually, allocated as follows: 19 - with an adjectival element in post-position, and 1 - with the adjectival element in pre-position. The recent grammar has left out the nominal phrase (1) with discontinuous adjectival element.

In this paper, we first explain why the corpus-based approach is relevant for the diachronic data,and discuss some theoretical assumptions about diachronic data. We then introduce the sample corpus data and discuss the principles of annotation in comparison with other pos-annotation schemes for historical language data (for example, Tycho Brahe Parsed Corpus of Historical Portuguese). Finally, we explain the principles of applying grammar rules and discuss the resulting NP-structures extracted, in comparison with manually extracted NP-phrases from the same text.

References

Dimitrova-Vulchanova, M. & V. Vulchanov 2003. - Observations on the ordering of modifiers in nominal expressions in some Old Bulgarian texts with reference to Greek. In: Bogdanova, S. (ed.) Slavia Orthodoxa. Ezik i kultura . Sofia.

Dimitrova-Vulchanova, M. & V. Vulchanov, in press. - An article evolving: the case of Old Bulgarian, In: Anderson, S. & D. Jonas (eds.) DIGS 8 Volume

Jagic, 1883/1960. - Jagic, V. Quattuor evangeliorum versionis palaeoslovenicae codex Marianus glagoliticus . Berlin - St. Petersburg/Graz.

Simov et. al., 2001. - Kiril Simov, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov, Atanas Kiryakov. 2001. CLaRK - an XML-based System for Corpora Development. In: Proc. of the Corpus Linguistics 2001 Conference, pp. 558-560.

Vulchanov, 2005. - A Specialized Corpus of Old Bulgarian Nominal Expressions from Codex Suprasliensis (Valentin Vulchanov). At: www.hf.ntnu.no/hf/adm/forskning/prosjekter/balkansim/databases.html

Zaimov & Capaldo, 1982. - Zaimov, J. & M. Capaldo (eds.). Suprasâlski ili Retkov Sbornik . Sofia: Izdatelstvo na BAN.