ESTONIAN ACADEMY
PUBLISHERS
eesti teaduste
akadeemia kirjastus
The Yearbook of the Estonian Mother Tongue Society cover
The Yearbook of the Estonian Mother Tongue Society
Impact Factor (2022): 0.3
Research article
Keelekasutusreeglite tuletamine ja veatuvastus määrsõna sisaldavate sõnaliigijärjendite näitel; pp. 9–34
PDF | http://doi.org/10.3176/esa69.01

Authors
Kais Allkivi-Metsoja, Pille Eslon, Jaagup Kippar
Abstract

The article introduces a software tool that allows us to detect regularities and errors in Estonian language texts, based on the usage contexts of POS-grams. It converts each sentence to a POS string and extracts trigrams, i.e., three-word sequences. Then, it calculates the probabilities of various preceding and subsequent contexts, which can either be a certain POS, or the beginning or the end of a sentence. Error detection relies on the comparison with a statistical language model.

In this paper, we focus on the contexts of adverb-containing POS-grams, which are prone to word order errors. Our aim is two-fold: 1) using the Estonian Reference Corpus, we build a language model and analyse it to describe the POS-grams that are preferably used in the context of sentence onset or ending; 2) we evaluate the error detection performance of the tool on the EstGEC-L2 test corpus, consisting of error-annotated sentences from second language learner writings. The cut-off value for defining rare contexts is set to 5%.

We find that the POS-grams commonly used in sentence onsets are lexicogrammatically more stereotypical, while those preferred at the end of a sentence show more variation. POS-gram analysis also proves to be useful in pointing out word order errors, unnecessary and missing words, occasionally word choice and spelling errors (if POS detection is affected). Most frequently, the detected errors violate the V2 word order at the beginning of a sentence/clause. Other word order errors occur mainly at the sentence/clause ending.

 

Tutvustame uut tarkvara, mis võimaldab sõnaliigijärjendite esinemiskonteksti alusel tuletada eesti keele kasutusreegleid ja tuvastada grammatikavigu. Nii on ka artikli eesmärk kahetine. Tuginedes eesti keele koondkorpuse põhjal loodud statistilisele keelemudelile ja keskendudes määrsõna sisaldavatele kolmesõnalistele järjenditele, 1) anname ülevaate sõnaliigikooslustest, mida eelistatakse kasutada lause alguses ja lõpus; 2) kirjeldame veakohti, mis tulevad sõnajärjendite ebatõenäolise kasutuskonteksti järgi esile eesti keele kui teise keele õppijate tekstiloomes. Leitud vead on sagedamini seotud V2-sõna- järjega (osa)lause alguses ja määruse paigutusega lause lõpus, valitud meetod aitab avastada ka puuduvaid ja liigseid sõnu. Normipärase ja ebatüüpilise keelekasutuse kombineeritud analüüs annab ainest nii automaatse veatuvastuse täpsustamise kui ka veaparanduste soovitamise jaoks.

References

Alam, Jahangir Md., Naushad UzZaman, Mumit Khan 2007. N-gram based statistical grammar checker for Bangla and English. – Proceedings of 9th International Conference on Computer and Information Technology, 3–6.

Allkivi-Metsoja, Kais, Jaagup Kippar 2023. Spelling correction for Estonian learner language. – Proceedings of the 24th Nordic Conference on Computational Linguistics, 782–788.

Aulamo, Mikko 2019. Using POS n-grams to detect grammatical errors in Finnish text. Magistritöö. Helsingi Ülikool.

Brett, David, Antonio Pinna 2015. Patterns, fixedness and variability: using PoS-grams to find phraseologies in the language of travel journalism. – Procedia – Social and Behavioral Sciences 198 (2015), 52–57. 
https://doi.org/10.1016/j.sbspro.2015.07.418

Bryant, Christopher, Mariano Felice, Ted Briscoe 2017. Automatic annotation and evaluation of error types for grammatical error correction. – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 793–805. 
https://doi.org/10.18653/v1/P17-1074

Cappelle, Bert, Natalia Grabar 2016. Towards an n-grammar of English. – Applied Construction Grammar. Ed. by Sabine De Knop, Gaëtanelle Gilquin. De Gruyter Mouton, 271–302. 
https://doi.org/10.1515/9783110458268-011

De Cock, Sylvie, Sylviane Granger 2021. Stance in press releases versus business news: A lexical bundle approach. – Text and Talk 41 (5–6), 691–713. 
https://doi.org/10.1515/text-2020-0040

Eslon, Pille, Kais Allkivi-Metsoja 2018. Teksti keelekasutusmustrid ja lingvistiline klasteranalüüs. – Lähivõrdlusi 28. Lähivertailuja 28. Peatoim. Annekatrin Kaivapalu. Tallinn: Eesti Rakenduslingvistika Ühing, 21–46. 
http://dx.doi.org/10.5128/LV28.01

EstGEC-L2 2023 = Estonian L2 Grammatical Error Correction Corpus (EstGEC-L2). Github. https://github.com/tlu-dt-nlp/EstGEC-L2-Corpus.

EstSpacy 2021 = SpaCy pipelines for Estonian language. Github. https://github.com/EstSyntax/EstSpaCy.

Jackendoff, Ray 2017. In defense of theory. – Cognitive Science 41 (S2), 185–212. 
https://doi.org/10.1111/cogs.12324

Kapusta, Jozef, Martin Drlik, Michal Munk 2021. Using of n-grams from morphological tags for fake news classification. – PeerJ Comput. Sci. 7 (624). 
https://doi.org/10.7717/peerj-cs.624

Luhtaru, Agnes, Mark Fišel, Elizaveta Korotkova 2024. No error left behind: Multilingual grammatical error correction with pre-trained translation models. – Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 1209–1222.

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D. Manning 2020. Stanza: A Python natural language processing toolkit for many human languages. – Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 101–108. 
https://doi.org/10.18653/v1/2020.acl-demos.14

Sirts, Kairit, Kairit Peekman 2020. Evaluating sentence segmentation and word tokenization systems on Estonian web texts. – Human Language Technologies – The Baltic Perspective, 174–181. 
https://doi.org/10.3233/FAIA200620

Sõnaliigijärjendite leidja 2024. Github. 
https://github.com/tlu-dt-nlp/POSgram-contexts/blob/main/posgram_finder_demo_et.ipynb

Sõnaliigijärjenditel põhinev veatuvastus 2024. Github. 
https://github.com/tlu-dt-nlp/POSgram-errors/blob/main/error_finder_demo_et.ipynb

Wu, Jian-cheng, Jim Chang, Jason S. Chang 2013. Correcting serial grammatical errors based on n-grams and syntax. – Computational Linguistics and Chinese Language Processing 18 (4), 31–44.

Back to Issue