Identification of Marathi and Sanskrit Compound and Non-compound Word using Genetic Algorithm

Text based language recognition is the task of recognizing a language from a given text of document automatically. It is complicated to distinguish languages within language families than other families. In this paper, the performance of statistical measures has been investigated to determine the text-based language identification system with prominence on five languages used in India based on Devanagari script –Marathi, Hindi, Sanskrit, Bhojpuri and Nepali. n-grams is used as feature for classification in the proposed system. Language Identification is a main pre-processing step in several tasks of Natural Language Processing (NLP). There is wide scope in a multilingual society like India for automatic language identification since it would be a fundamental step in bridging the digital segregate between the Indian masses and the world. Keywords - Devanagari Script, Multilingual Computing Wiener filter, Curvelet transform, Genetic algorithm