Development of rules of generation of nominal word forms for new-written variants of the Karelian language
English
journal number:
Journal’s Subject Headings:
Philology
About author:
I. P. Novak Institute of Linguistics, Literature and History, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
N. B. Krizhanovskaya Institute of Applied Mathematical Research, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
T. P. Boyko Institute of Linguistics, Literature and History, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
N. A. Pellinen Institute of Linguistics, Literature and History, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
ABSTRACT
Introduction: linking of words of texts (tokens) with meanings of lemmas in the dictionary of VepKar corpus significantly facilitates further work on semantic markup of texts. In 2019, inflectional rules were developed for the Vepsian subcorpora VepKar. To the corpus on the base of these rules a function for generation of a complete paradigm on basic word forms was added. VepKar editors need to enter a large number of word forms when they create dictionary entries in three Karelian subcorpora (about 30 for names and 150 for verbs). Therefore, the development of an algorithm and a computer program for generation of word forms of the Karelian language turned out to be timely.
Objective: to illustrate how you can use the list of the stems of the nominal parts of speech of two new-written dialects of the Karelian language to create rules for automatic generation of word forms.
Research materials: lemmas and word forms from the Open corpus of the Vepsian and Karelian languages, the Corpus of Border Karelia, and the electronic version of the Dictionary of the Karelian language.
Results and novelty of the research: grammatical patterns were studied over many years from theoretical sources, and they were also discovered through experiments. Thanks to this, the list of stems and pseudo-stems of word forms was formed for the nominal parts of speech, the system of rules for generation of word forms was developed, and the corresponding computer program is written and tested. The scientific novelty of the study lies in the first attempt to develop uniform rules for the automatic generation of word forms for two dialects of the Karelian language.
Key words: Karelian language, new-written language, corpus linguistics, morphology, nominal inflection, generation of word forms.
Acknowledgements: the study is carried out under the state order of the Karelian Research Centre of the Russian Academy of Sciences. The section «Development of the program of generation» was written by N. B. Krizhanovskaya in the framework of the project of the Russian Foundation for Basic Research No. 18-012-00117.
For citation: Novak I. P., Krizhanovskaya N. B., Boiko T. P., Pellinen N. A. Development of rules of generation of nominal word forms for new-written variants of the Karelian language // Vestnik ugrovedenia = Bulletin of Ugric Studies. 2020; 10 (4): 679–691.
N. B. Krizhanovskaya Institute of Applied Mathematical Research, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
T. P. Boyko Institute of Linguistics, Literature and History, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
N. A. Pellinen Institute of Linguistics, Literature and History, Karelian Research Centre of the Russian Academy of Sciences, Petrozavodsk, Russian Federation, [email protected]
ABSTRACT
Introduction: linking of words of texts (tokens) with meanings of lemmas in the dictionary of VepKar corpus significantly facilitates further work on semantic markup of texts. In 2019, inflectional rules were developed for the Vepsian subcorpora VepKar. To the corpus on the base of these rules a function for generation of a complete paradigm on basic word forms was added. VepKar editors need to enter a large number of word forms when they create dictionary entries in three Karelian subcorpora (about 30 for names and 150 for verbs). Therefore, the development of an algorithm and a computer program for generation of word forms of the Karelian language turned out to be timely.
Objective: to illustrate how you can use the list of the stems of the nominal parts of speech of two new-written dialects of the Karelian language to create rules for automatic generation of word forms.
Research materials: lemmas and word forms from the Open corpus of the Vepsian and Karelian languages, the Corpus of Border Karelia, and the electronic version of the Dictionary of the Karelian language.
Results and novelty of the research: grammatical patterns were studied over many years from theoretical sources, and they were also discovered through experiments. Thanks to this, the list of stems and pseudo-stems of word forms was formed for the nominal parts of speech, the system of rules for generation of word forms was developed, and the corresponding computer program is written and tested. The scientific novelty of the study lies in the first attempt to develop uniform rules for the automatic generation of word forms for two dialects of the Karelian language.
Key words: Karelian language, new-written language, corpus linguistics, morphology, nominal inflection, generation of word forms.
Acknowledgements: the study is carried out under the state order of the Karelian Research Centre of the Russian Academy of Sciences. The section «Development of the program of generation» was written by N. B. Krizhanovskaya in the framework of the project of the Russian Foundation for Basic Research No. 18-012-00117.
For citation: Novak I. P., Krizhanovskaya N. B., Boiko T. P., Pellinen N. A. Development of rules of generation of nominal word forms for new-written variants of the Karelian language // Vestnik ugrovedenia = Bulletin of Ugric Studies. 2020; 10 (4): 679–691.