Everything you need to know about text parsing

Posted on Aug 22, 2014

What is text parsing ?

In single line - Simplifying text data into meaningful entities is called as text parsing.

Parsing in general, is the complete process of analysing the text object (sentences, paragraphs, corpus etc) either using rules of a formal grammar or syntatical data patterns or both and breaking down into entities, actors and chunks which can be used for data driven actions. It's a common practice of Natural Language Processing and is widely used nowdays in various computer specific domains such as Information Retrival, Automated Chat Bots and Data Mining

Why there is a need of text parsing ?

Textual data is growing rapidly - social media conversations, news articles, customer reviews, product descriptions, company reports, financial reports, intellectual properties documents, speech transcripts and user search queries are all such examples. Majority of this data is highly unstructured and present in non analysable formats. One cannot simply play with this data for research purposes, decision making, analysis or predicting the outcomes. To extract information and insights from this data, it needs to be parsed first. Hence Text parching is an important aspect while dealing with unstructured data.

Approaches for text parsing

At a broader level, text parsing is generally performed using two approaches. One is data layout driven, other is grammer driven. For a better accuracy, a mix of both is recommended. Text Parsing is performed in hierarichial manner which starts with removal of unuseful context from the data, followed by analysis on the left over chunks using different text mining techniques.

Techniques used for text parsing

Lexicon Level

At lexicon (word) level, parsing is generally performed using keyword matching techniques after preprocessing of text which includes techniques such as stemming, lemmatization, stopwords, punctuations removal, word tokenization etc.

Context level

N grams are the informative entities of text and can be used for determining the context. Regular Expressions are widely used to find and capture insights using patterns. for example date matching, number caprturing, particular formats in the text. Case and Type checking of keywords can be applied to understand text better and parse it efficiently.

Grammer level

Using part of speech as text features really helps in flitering out relevant and noisy information from text. Context free grammetical relations among the words of a sentence also indicates about syntactical nature of text which infact can be used for parsing purposes, Some examples are - negations, intensifiers, modifiers, nsubjects, complemetries etc. Named Entity Recognition and Topic Modelling are other techniques which can be used for text to predictors mining, which intern can be used to parse out a text object into fruitful info.

Share your views in the comments if some points are missing and do share your experiences with text parsing.