Now, let’s discuss what EDA actually means.
Exploratory data analysis uses data visualisation techniques to draw inferences and obtain insights from them. However, EDA is much more than plotting graphs or visualising data, it is more about understanding and studying the given data in detail. Visualisation of data into plots/graphs can be termed one of the tools in the EDA process.
EDA also involves the preparation of data sets for analysis by removing irregularities in the data so that these irregularities do not affect further steps in the process of data analysis and machine learning model building.
Let’s hear from Rahim to understand the essence of EDA with the use of some practical examples.
You have gone through the utility of EDA, which is as follows:
- Maximise the insight in the data set.
- Detect outliers and anomalies.
- Test underlying assumptions.
You also saw how the box plot of the banking data set gives a clear idea that more positive responses came from people with higher salaries because 50% of the data with a ‘yes’ response lies in the higher salary region. This is despite the fact that people with positive as well as negative responses have almost the same median values.
It is generally believed that higher discount means more sales. However, from the sales example covered previously, you understood that after a certain level of discount, sales actually start dropping. One of the possible inferences that we can draw from this is that customers may believe that a very high discount implies a compromise of quality.
Also, through the e-commerce example, you must have understood that frequent buyers have more returns frequency.
So, now you have understood that EDA is an important exercise before proceeding further with a data set. It does not involve merely finding irregularities in the data, such as missing values or outliers; it is a combination of fixing the data set for useful purposes and then deriving maximum insights from that data, by either plotting graphs or using statistical parameters.
An important takeaway from this is that EDA should be the first step in any data science / machine learning activity. Based on the results of EDA, companies also make business decisions, which can have repercussions later. Hence, we observe the following:
- If not performed properly, EDA can hamper the further steps in the machine learning model building process.
- If done well, it may improve the efficacy of all we do in the next steps.
In the next video, you will get an idea of how EDA has evolved and what kind of work has been done before in this field.
Let’s listen to a brief history of EDA from our expert, Rahim.
In 1977, John W. Tukey wrote a book on EDA and developed box plots, which are also called Tukey’s box plots. Since then, many books have been written in the field. You may refer to this link to get the resources of the evolution of EDA.
In the next segment, you will learn about data sourcing and the types of data sources.
FREQUENTLY ASKED QUESTIONS (FAQ)