Syntax Parser on Kannada Texts With CYK Algorithm and Probabilistic Context Free Grammar
Limited work has been done on several Indian Languages related to Syntax and Semantic Parsing by adopting recent Natural Language Processing (NLP) techniques. In this paper we present a framework for statistical Syntax Parser developed on Kannada language which is one of the south Indian languages. As a preliminary step, we generated Kannada Treebank dataset from 1000 annotated sentences (tagged with Parts of Speech labels) by applying Cocke-Younger-Kasami (CYK) parsing algorithm. CYK algorithm works only if the input grammar is specified in CNF (Chomsky Normal form). Hence in our proposed method, the Context Free Grammar (CFG) written for Kannada language is transformed into a CNF grammar. Basically Treebank dataset is a collection of parsed or syntactically structured texts. The Treebank dataset which is generated in the preliminary step contains 1000 syntactically structured sentences and it is given as a training data to build syntax parser model. The size of the dataset which we have taken for testing the model is 150 unannotated sentences. While training the parser and extracting grammar from a Treebank dataset, PCFG (Probabilistic Context Free Grammar) is also incorporated. The developed syntax parser model takes Kannada raw sentences as input, makes use of CNF grammar which is derived from the training Treebank dataset, applies CYK parsing algorithm on input texts and finally produces the most probable parse tree for every sentence as the output. The model has been tested and evaluated with golden Treebank dataset. We obtained considerable and encouraging results from the prepared parser model. The overall precision, recall and F1-score achieved are 74.2%, 79.4% and 75.3% respectively.
Keywords - Natural Language Processing, Syntactical Parser, Chomsky Normal Form, Cocke-Younger-Kasami, Probabilistic Context Free Grammar.