You can refer to my video tutorial for the same at YouTube
Repository for Ultimate Resource in python. Drop a star if you find it useful! Got anything to add? Open a PR on the same!
Coronavirus cases are increasing rapidly worldwide. This tutorial will guide you on how to web scrape Coronavirus data and into Ms-excel.
What will be covered in this blog
- Introduction to Web Scrapping
- Understanding HTML basics
- How to scrape a website
- How to export the data into excel file
Pre Requisites
- python
- Beautiful soup
- pandas
- HTML
- CSS
What is Web Scrapping?
Introduction
If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale.
Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.
Two important points to be taken into consideration here:
Always be respectful and try to get permission to scrape, do not bombard a website with scraping requests, otherwise, your IP address may get blocked!
Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
The process: three simple steps
- Request for a response from the webpage
- Parse and extract with the help of Beautiful soup and lxml
- Download and export the data with pandas into excel
Its Uses
It can serve several purposes, most popular ones are Investment Decision Making, Competitor Monitoring, News Monitoring, Market Trend Analysis, Appraising Property Value, Estimating Rental Yields, Politics and Campaigns and many more.
If you wish to know about it further. I am attaching the Wikipedia link here. You can have a look.
What is Coronavirus?
I do not think Coronavirus needs an introduction, but just in case if someone does not know, Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
The COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes
If you wish to know about it further. I am attaching the Wikipedia link here. You can have a look.
The data source
We need a webpage to fetch the coronavirus data from. So I am using the Worldometer website here.
You can use the attached link to navigate to the website. You can also refer WHO website Worldometer's webpage will look something like this
We are interested in the data contained in a table at Worldometer’s website, where it lists all the countries together with their current reported coronavirus cases, new cases for the day, total deaths, new deaths for the day, etc
Now we are ready to start the coding.
Time to Code
You can find the code at my Github Repository
Let's Understand it
There are a few libraries you will need, so first, you need to install them. Go to your command line and install them.
pip install requests
pip install lxml
pip install bs4
Now let's see what we can do with these libraries.
Understanding HTML basics
This is the basic syntax of an HTML webpage. Every serves a block inside the webpage:
<!DOCTYPE html>
: HTML documents must start with a type declaration.- The HTML document is contained between
<html>
and</html>
. - The meta and script declaration of the HTML document is between
<head>
and</head>
. - The visible part of the HTML document is between
<body>
and</body>
tags. - Title headings are defined with the
<h1>
through<h6>
tags. - Paragraphs are defined with the
<p>
tag. - Other useful tags include
<a>
for hyperlinks,<table>
for tables,<tr>
for table rows, and<td>
for table columns.
requests
- Use the requests library to grab the page.
- This may fail if you have a firewall blocking Python/Jupyter.
- Sometimes you need to run this twice if it fails the first time.
import requests
#make requests from webpage
result = requests.get('https://www.worldometers.info/coronavirus/country/india/')
print(result.text)
The request library that we downloaded goes and gets a response, to get a request from the webpage, we use requests.get(website URL)
method.
If the request is successful, it will be stored as a giant python string. We will be able to fetch the complete webpage source code when we run result.text
.
But the code will not be structured.
Beautiful soup
BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). It is a Python library for pulling data out of HTML and XML files.
Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage.
If u wish to know more about BeautifulSoup, I am attaching the Documentation link here. You can have a look.
import bs4
soup = bs4.BeautifulSoup(result.text,'lxml')
print(soup)
And now import B.S. for the next step is to actually create the soup
variable. And we're gonna pass on two things here, result.text
string and lxml
as a string.
Lxml goes through this HTML document and then figure out what is a CSS class, What is a CSS I.D. , Which are the different HTML elements and tags etc.
Extracting the data
Find the div To find the element you need to right-click and hit inspect on the number of cases. Refer the attached snapshot below.
we need to find the right class. class_= 'maincounter-number'
serves our purpose. Refer the attached snapshot below.
The Beautiful Soup object has been created in our Python script and the HTML data of the website has been scraped off of the page. Next, we need to get the data that we are interested in, out of the HTML code.
cases = soup.find_all('div' ,class_= 'maincounter-number')
print(cases)
There is still a lot of HTML code that we do not want. Our desired data entries is wrapped in the HTML div element and inside class_= 'maincounter-number'
. We can use this knowledge to further clean up the scraped data.
Storing the data
we need to save the scraped data in some form that can be used effectively. For this project, all the data will be saved in a Python List. To achieve this we will:
data = []
#find the span and get data from it
for i in cases:
span = i.find('span')
data.append(span.string)
print(data)
we will use span
to fetch data from div
.
We need the numbers, we do not want us to be dealing with the tags. So we will use span.string
to get the numbers.
We will update the numbers into the data
list
Now that we have the numbers we are ready to export our data into an excel file.
Exporting the data
Our last step is to export the data to Ms-excel. To fulfil the purpose, I am making use of Pandas
To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice.
If u wish to know more about Pandas, I am attaching the Documentation link here. You can have a look.
import pandas as pd
df = pd.DataFrame({"CoronaData": data})
#naming the coloumns
df.index = ['TotalCases', ' Deaths', 'Recovered']
#file-name
df.to_csv('Corona_Data.csv')
DataFrame is a 2-dimensional labelled data structure, potentially heterogeneous tabular data structure with labelled axes (rows and columns).
df = pd.DataFrame({"CoronaData": data})
is used to create a DataFrame and give it a name and map it to the data
list that we created earlier.
Next, we will give column names with df.index
. It will look something like this.
Final step,
We are ready to export the data into Ms-excel. We will use df.to_csv
for the same
Here's our result
NOTE: The output depends on the current statistics
You can find the code at my Github Repository
You can also connect with me on Twitter
I hope this helped you in understanding how to Web scrape coronavirus data into excel. Also, have a look at my other Blogs:
- Python 3.9: All You need to know
- The Ultimate Python Resource hub
- GitHub CLI 1.0: All you need to know
- Become a Better Programmer
- How to make your own Google Chrome Extension
- Create your own Audiobook from any pdf with Python
- You are Important & so is your Mental Health!
If you have any Queries or Suggestions, feel free to reach out to me in the Comments Section below.
Resources: