Study and Analysis of the Rare Word Handling Techniques for Medical Text Data
In today’s era, most of data on internet is in unstructured form in domains as finance, social media, medical, computer science, artificial intelligence, business, education and training. To mine text data, text summarization, feature extraction, classification, clustering, topic tracking, association rule mining and trend finding etc. methods available. Text classification is used to categorise data into pre-defined class/categories. The pr-processing steps, a part of classification process, include tokenization, stop words removal, stemming, word vector representation using binary bag-of-words, term frequency (TF), TF-IDF etc., feature selection and extraction. Text data processing has challenges like synonyms, polysemous/homonymous, common, low frequency etc. In this paper, the focus is one issue - rare terms/features (low frequency terms). Experiments show that low frequency terms are more informatics or having subjective knowledge related to domain. Accuracy of classifiers increase compare to all existing methods.
Keywords - Data Representation, Feature Extraction, Text Classification, Low Frequency Words, Rare Words, TF-IDF, Feature Weighting, Text pre-processing