Email Spam Identification, category classification of news and organization of web pages by search engines are the modern world examples for document classification. It is a technique to systematically classify a text document in one of the fixed category, or In other words, tagging of a text document can be described as document classification process. This technique is really helpful when the amount of data is too large, especially for organizing, information filtering, and storage purposes.
In this article, we will discuss an approach to implement an end to end document classification pipeline using Apache Spark, and we will use Scala as the core programming language. Apache Spark is the ideal choice while dealing with a greater volume and variety of data. Apache Spark’s machine learning library – Mllib is scalable, easy to deploy and is hundred times faster than MapReduce operations.
Read Complete Blog on Analytics India Magzine Here