Getting Started with Data Science - Data Munging

Posted on Dec 30, 2015


In the last article of getting started series, we discussed the meaning of term data science and its components. In this article, we will look at Data Munging, its importance and how it is performed.

In any data analysis, data preparation accounts for the majority of the effort, 70% of the time is involved in curating and fixing the dataset. It is necessary to get acquainted with each and every dimension of the available data before any analysis. Data Munging is the crucial component of data science which involves all the activities of exploring, tweaking and customizing the dataset according to the problem statement. Lets look at these activities:

Feature Exploration Every data problem starts with the understanding of available data. In feature exploration, features are studied and identified as continuous or discrete, target or input, dependent and independent features. In this step, various relationships among the features and the strength of relationships are explored.

Feature Engineering Generally, a dataset is comprised of a mixture of features. It might contain irrelevant variables, which are of no use and can produce noisy results in the analysis. The features may be skewed and may not be have the similar data types (integers, text, numbers, booleans etc). There may also be some latent complex correlations among the data variables due to interdependencies.

Feature engineering is an important component of data munging in which features of a dataset are refined, cleaned and fixed with the use of domain knowledge, statistics and meta analysis. Using feature exploration, the available data is made more useful for analysis. In feature engineering, raw features are converted into simpler and non-skewed features with the means of logarithms or polynomials. This helps in transformation of complex dependencies, and removal of data skewness. Many new features can also be derived via feature engineering.

Missing Values Detection Every observation of a dataset represents the patterns exhibited by it. Any missing record can create the biases and might result in reduction of its representativeness. Missing data has a significant effect on the outcomes drawn from a data analysis process. An important part of data munging is missing values treatment in which vacant data is identified and imputed. Depending upon the quantity of missing data, it can either be deleted, imputed using central tendencies (mean, weighted mean, or median) or predicted using regression or clustering techniques.

Noise Detection and Removal There can be different types of noises present in the data sets. Anything unwanted information which is not relevant to the problem statement is called noise. For example – in dataset of tweets- all the stopwords (commonly used words) such as “a, the, of” are a noise and needs to be removed. In this step, all the noises are identified from the main dataset. They are either converted to useful entities or removed completely.

Outlier Detection An outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. Just like the missing values treatment, for a better analysis and results, it is important to remove the outliers or impute them.

Conclusion Understanding the type of data before analysis is the very important. If we directly fit a model to a dataset, the accuracies vary and might result in poor performances. In other words, all the activities performed on a raw data set to make it clean enough to input to a data analysis model are called data munging activities.