Let’s summarise our understanding of data sourcing, about which you learnt in this session.

There  are two main types of data:

  • Public data: This is the data that is made publicly available for the purpose of research and learning.
  • Private data: This is organisational data, and organisations have some security and privacy concerns. Company approvals are needed to access such data. It is useful for internal policymaking and business strategy building for an organisation.

Given below are the links to some public datasets. You may explore these open sources to get the data:

GitHub:Awesome public data setsGithub data sets.

Open government data set: Open Government data.

Kaggle website link: Kaggle Website.

UCI repository of machine learning: UCI machine learning data set repository.

Apart from the learning of public and private data sources, you learnt about a data fetching technique, i.e., web scraping, which is very useful to fetch data directly from webpages. It is useful in many applications like e-commerce price comparison, real estate, share market, etc.

Web scraping majorly involves 4 steps:

  • HTML loading and reading: It includes the loading of the HTML page into Python. The library which is used here to request for the HTML page is the “request” library. 
  • HTML parsing: This step involves the process of presentation of HTML code into a readable format. One of the important classes of Python called “BeautifulSoup” is used here to parse the data.
  • Data extraction: This step involves the extraction of data from the web page using HTML elements like tags and attributes.
  • Transformation into required format: Once you have the data, you can save it into your required format, like CSV.

Data sourcing is the very first step of EDA, and after getting the data into the required file types, we need to clean it up. In the next session, you will learn the end-to-end process of cleaning a data set with the help of a practical case study.

Report an error