Data Cleaning Introduction

Welcome to the next step in the process of EDA called ‘Data Cleaning’.

In the last session, you learnt about the data sourcing techniques. Once you source the data, it is essential to get rid of the irregularities in the data and fix it to improve its quality.

One can encounter different kinds of issues in a dataset. Irregularities may appear in the form of missing values, anomalies/outliers, incorrect format and inconsistent spelling, etc. These irregularities may propagate further and affect the assumptions and analysis based on that dataset and may hamper the further process of machine learning model building. Hence, data cleaning is a very important step in EDA.

In this session

In this session, you will learn the process of data cleaning using a case study on ‘Bank Marketing Campaign Dataset’. Though data cleaning is often done in a somewhat haphazard manner, and it is difficult to define a ‘single structured process’, you will study data cleaning through the following steps:

  • Identifying the data types.
  • Fixing the rows and columns.
  • Imputing/removing missing values.
  • Handling outliers.
  • Standardising the values.
  • Fixing invalid values.
  • Filtering the data.

Before going any further, it is important for you to get familiar with the problem statement that you are going to solve in this module to understand the EDA practically.

Problem statement

The bank provides financial services/products such as savings accounts, current accounts, debit cards, etc. to its customers. In order to increase its overall revenue, the bank conducts various marketing campaigns for its financial products such as credit cards, term deposits, loans, etc. These campaigns are intended for the bank’s existing customers. However, the marketing campaigns need to be cost-efficient so that the bank not only increases their overall revenues but also the total profit. You need to apply your knowledge of EDA on the given dataset to analyse the patterns and provide inferences/solutions for the future marketing campaigns.

Download the CSV file of the bank marketing dataset from the following attachment.

A bank conducted a telemarketing campaign for one of its financial products called ‘term deposits’ to help foster long-term relationships with existing customers. The dataset contains information about all the customers who were contacted during a particular year to open term deposit accounts with the bank.

What is a term deposit?

Term deposits, also called fixed deposits, are the cash investments made for a specific time period ranging from 1 month to 5 years for predetermined fixed interest rates. The fixed interest rates offered for term deposits are higher than the regular interest rates for savings accounts. The customers receive the total amount (investment plus the interest) at the end of the maturity period. Also, the money can only be withdrawn at the end of the maturity period. Withdrawing money before that will result in penalty charges, and the customer will not receive any interest returns.

Important Note:

To enhance the learning outcome, you are expected code along with the instructor as you watch the videos. So, please pace yourself accordingly. To assist you, you are provided with a structured and blank Python notebook to code. This is a mustdo task for you to answer certain in-segment questions, as it serves the purpose of practice. Also, the final notebook will act as a reference for you in the future as well.

Please do not expect a complete solution notebook attached at the end of this module.

Guidelines for in-module questions

The in-video and in-content questions for this module are not graded. Note that graded questions are given on a separate page labelled ‘Graded Questions’ at the end of each session. The graded questions in these sessions will adhere to the following guidelines:

People you will hear from in this session

Subject Matter Expert

Mirza Rahim Baig

Analytics Lead, Flipkart

Flipkart is one of the leading e-commerce companies in India. It started with selling books and has now expanded its business to almost every product category, including consumer electronics, fashion and lifestyle products. Rahim is currently the Analytics Lead at Flipkart. He holds a graduate degree from BITS Pilani, a premier educational institute in India.

Subject Matter Expert

Anand S

CEO, Gramener

Gramener is one of the most prominent data analytics and visualisation companies in India. Anand, currently the CEO, was previously the Chief Data Scientist at Gramener and also has extensive experience in management consulting and equity research.

Report an error