sarikaya06@interspeech_2006@ISCA

Total: 1

#1 Maximum entropy modeling for diacritization of Arabic text [PDF] [Copy] [Kimi]

Authors: Ruhi Sarikaya ; Ossama Emam ; Imed Zitouni ; Yuqing Gao

We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated using maximum entropy (MaxEnt) models. The diacritization process is formulated as a search problem where the most likely diacritization realization is assigned to a given sentence. We also propose a diacritization parse tree (DPT) for Arabic that allows joint representation of diacritics, graphemes, words, word contexts, morphologically analyzed units, syntactic (parse tree), semantic (parse tree), part-of-speech tags and possibly other information sources. The features used to train MaxEnt models are obtained from the DPT. In our evaluation we obtained 7.8% diacritization error rate (DER) and 17.3% word diacritization error rate (WDER) on a dialectal Arabic data using the proposed framework.