The IMDB website provides movie-related information such as release date, runtime duration, cast, genre, ratings, etc. Now, consider a specific webpage on the IMDB website that lists the top-rated movies along with information about them. It lists the top 50 rated movies, the number of votes, etc. Now, what if you want to perform a deeper analysis to answer certain questions as follows?
- Which director has the highest number of movies in the top 50 rated movies?
- Which genre has the highest rating among the top 50 rated movies?
- What is the gross expenditure of the lowest-rated movies as compared with the highest-rated ones?
One of the approaches to answering these kinds of questions is manual, which involves checking the information manually and entering it in a spreadsheet. Does this not seem to be a tedious and mundane task? This is where the technique of web scraping comes into the picture. It eases the task of obtaining and processing data with the help of a structured format. This would help you perform deeper analyses and answer the aforementioned questions.
So, web scraping helps to fetch such information from websites. You will learn about web scraping in three parts:
- Need for and application of web scraping.
- The basics of an HTML page.
- Python libraries and codes for web scraping.
So, let’s listen to Rahim to understand how useful web scraping is.
Here, Rahim has given you an idea of how web scraping is useful for price comparison for e-commerce companies. It is also useful for tracking the stock market to identify the right time to buy or sell shares. Web scraping may also be helpful in the real estate sector to get the right property for a suitable price at the right location, etc.
You can refer to this website to get an idea of the practical need for web scraping.
Note:
Please note that web scraping is not always legal for all websites. Certain websites provide access to others to scrape their data from the web page. Another important aspect is that if the content is copyrighted (such as video, pictures and articles), it is not illegal to scrape it, but it is illegal to republish it. Also, you cannot scrape a website just to build a duplicate competing site; it is acceptable to scrape data as long as you are using it to create something new.
The Basics of an HTML Page
The basic requirement of web scraping is the web page that we are going to scrape. All web pages are written in HTML. So, you can perform web scraping using Python only after you understand the basic structure of an HTML page.
So, let’s get into the basics of how an HTML page looks and what its tags are.
Note:
Here, we are not going to have an end-to-end discussion on HTML codes; you will gain an understanding of only those concepts that are useful for fetching data from the web page in the scraping process.
HTML stands for ‘Hypertext Markup Language’. It is used for creating an electronic document to display it on the world wide web. Each page that you see on the internet is written in HTML.
You can learn the basics of HTML using the Wikipedia page on ‘Machine learning’ which is shown below in the form of snapshot.
You can check the HTML code of any web page by following the instructions provided below:
Open web page -> Right-click -> Inspect
Once you click on ‘inspect’, you will see the HTML code of this particular page on the right side of the screen as shown in the screenshot provided below.
HTML code has a tree-like hierarchical structure, or nested structure, which contains a Head and a Body. The web page that you see on screen is due to the ‘body’, which contains most of the important codes for the web page.
An HTML page broadly consists of two basic elements:
- Attributes: These are used to describe the characteristics of an element. They majorly contain the class, id and href. These are like objects that are created to define the different segments of a web page.
- Tags: A tag is a way to represent an HTML element. Tags majorly contain h (heading), p (paragraph), a (hyperlink) and div.
Let’s briefly go through the attributes one by one.
- Class: The HTML class attribute is used to specify a single or multiple class names for an HTML element.
- Id: This attribute is used to provide a specific ID to an element.
- href: This attribute is used to provide any web page link that is embedded in the text on the HTML page.
A group of elements may have the same attributes but will have different tags. Let’s go through the tags using the Wikipedia page examples to understand the concept better.
- Heading: It is represented by ‘h’ in HTML code. It is used to place the headings of sections on a web page.
- Paragraph: It is represented by ‘p’ in HTML code. It is used to place a paragraph on the web page.
- Hyperlink: It is represented by ‘a’ in HTML code. It is used to provide a link to any other web page on the present web page.
- Div: It is used to structure the HTML page. It is a nested structure that contains other HTML elements. The main purpose of the div tag is to promote encapsulation.
- Span: This tag is used for grouping and applying styles to inline elements.
This is the basic information required to understand the HTML page structure. We will not be covering HTML codes in depth.
Now, try to answer the following questions on HTML basics.
FREQUENTLY ASKED QUESTIONS (FAQ)