Web Scraping-II

So, you have a very basic understanding of what an HTML page looks like. Now, let’s come to the application part, ie, how you can fetch the website data using python.

Let’s understand the HTML page of an IMDB web page from rahim in the next video.

You took a look at the web page of the top 50 IMDB movies, which contains movie names, rating, votes, director details and cast in a specific container-like structure. Now, you will learn how to code in order to fetch data from the web page in Python.

Here is a summary of the major takeaways from the video provided above:

  • request library: It is a Python library that is used to read the web page data from the URL of the corresponding page.
  • BeautifulSoup: It is a Python package that helps in parsing and extracting data from HTML and XML files.
  • The web scraping process can be divided into four major parts:
  • Reading: For HTML page read and upload.
  • Parsing: For beautifying the HTML code in an understandable format.
  • Extraction: For extraction of data from the web page.
  • Transformation: For converting the information into the required format, e.g., CSV.

You are provided with the well-commented jupyter notebook that was covered in the video. This is just for your reference, and it is a basic web scraping example. There are many other techniques and concepts in web scraping, but they are out of the scope of this module. However, you have been given an idea of the process of web scraping with a basic understanding of HTML.

You have gone through the top 50 movies page on IMDB’s website and seen the scraping process using python. This page contains the information of movies such as: name, rating, votes, runtime, genre, director details, actors and plot of the movie. As explained earlier, if you want to fetch information from the web page into a CSV file, then you need to look into its HTML code to get an idea about tags and attributes. 

Now try to answer the following questions on web scraping of the same IMDB top 50 movies web page.


Report an error