Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data

#1 Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data [PDF] [Copy] [Kimi¹] [REL]

Authors: Vamshi Krishna Srirangam, Appidi Abhinav Reddy, Vinay Singh, Manish Shrivastava

Named Entity Recognition(NER) is one of the important tasks in Natural Language Processing(NLP) and also is a subtask of Information Extraction. In this paper we present our work on NER in Telugu-English code-mixed social media data. Code-Mixing, a progeny of multilingualism is a way in which multilingual people express themselves on social media by using linguistics units from different languages within a sentence or speech context. Entity Extraction from social media data such as tweets(twitter) is in general difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. We present a Telugu-English code-mixed corpus with the corresponding named entity tags. The named entities used to tag data are Person(‘Per’), Organization(‘Org’) and Location(‘Loc’). We experimented with the machine learning models Conditional Random Fields(CRFs), Decision Trees and BiLSTMs on our corpus which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively.

Subject: ACL.2019 - Student Research Workshop

P19-2025@ACL

#1 Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data [PDF] [Copy] [Kimi1] [REL]

#1 Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data [PDF] [Copy] [Kimi¹] [REL]