EAP - Trames Publications

PUBLISHED
SINCE 1997

TRAMES. A Journal of the Humanities and Social Sciences

ISSN 1736-7514 (Electronic)
ISSN 1406-0922 (Print)

Open Access Journal

CiteScore: 0.8

Impact Factor (2022): 0.2

PROCESSING NATURAL MALAY TEXTS: A DATA-DRIVEN APPROACH; pp. 90–103

PDF | DOI: 10.3176/tr.2010.1.06

Author

Zuraidah Mohd Don

Abstract

This research represents the first attempt to produce a working system for the automatic processing of texts of Bahasa Melayu ‘Malay’. At the heart of the system is an integrated relational lexical database called MALEX, which draws on the experience of working on English and other languages, but which is specifically tailored to the conditions of Malay. The development of the database is from the beginning entirely data driven, and is based on the analysis of a corpus of naturally produced Malay texts. In designing procedures which access the database, properties of the text are consistently and rigorously distinguished from properties of the lexicon and of the grammar. The system is currently used to provide information for a range of applications, for grammatical tagging, stemming and lemmatisation, parsing, and for generating phonological representations. It is hoped and intended that the design features of MALEX will be transferable, and provide a model for the development of working systems for other Asian languages.

References

Abdullah Hasan (1974) The morphology of Malay. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Ahmad, F., M. Yusoff, and T. M. T. Sembok (1996) “Experiments with a stemming algorithm for Malay Words”. Journal of the American Society of Information Science 47, 12, 909–918.
doi:10.1002/(SICI)1097-4571(199612)47:12<909::AID-ASI4>3.0.CO;2-6

Blair, D. C. (1990) Language and representation in information retrieval. Amsterdam: Elsevier.

Garside, R. (1987) “The CLAWS word tagging system”. In The computational analysis of English: a corpus-based approach, 30–41. R. Garside, G. Leech, and G. Sampson, eds. London: Longman.

Jacquemin, C. and E. Tzoukermann (1999) “NLP for term variant extraction: synergy between morphology, lexicon and syntax”. In Natural language information retrieval, 25–74. T. Strzalkowski, ed. Dordrecht: Kluwer.

Katz, J.J. and J.A. Fodor (1963) “The structure of a semantic theory”. Language 39, 170–210.
doi:10.2307/411200

Knowles, G. and Zuraidah Mohd Don (2003) “Tagging a corpus of Malay texts, and coping with ‘syntactic drift’”. In Proceedings of the corpus linguistics 2003 conference. (UCREL Technical Paper, 16.) D. Archer, P. Rayson, A. Wilson, and T. McEnery, eds. Lancaster University: UCREL.

Knowles, G. and Zuraidah Mohd Don (2006) Word class in Malay: a corpus-based approach. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Knowles, G. and Zuraidah Mohd Don (2004) “The notion of a ‘lemma’: headwords, roots and lexical sets”. International Journal of Corpus Linguistics 9, 1, 69–82.
doi:10.1075/ijcl.9.1.04kno

Maučec, M. S., Z. Kačič, and B. Horvat (2004) “Modelling highly inflected languages”. Information Sciences—Informatics and Computer Science 166, 1–4, 249–269.

Sneddon, J. (1996) Indonesian: a comprehensive grammar. London: Routledge.

Sparck Jones, K. (1999) “What is the role of NLP in text retrieval?”. In Natural language information retrieval, 1–24. T. Strzalkowski, ed. Dordrecht: Kluwer.

Strzalkowski, T., ed. (1999) Natural language information retrieval. Dordrecht: Kluwer.

Back to Issue