<strong>Paper Title</strong><br>

Efficiently Distributed Representation of Words and Phrases using Negative Sampling for Regional Languages<br>

<br>


<strong>Abstract</strong><br>

This paper introduces the concept of multilingual word semantic similarity which helps in measuring semantic similarity of word pairs within languages: English, Hindi, Kannada, German, Italian. The model was trained efficiently with high quality datasets. The total dataset size used for all the five languages is about 90GB. This paper proposes a computationally efficient technique of measuring semantic similarity of word pairs by building a neural network model. This paper also introduces the idea of negative sampling in order to improve the accuracy of the model. We also propose a technique to detect phrases in order to improve our models accuracy. The results obtained show that combining statistical knowledge from text corpus (word embeddings) give very high accuracy.

Keywords - Word Embeddings, Negative Sampling, Phrase Detection.