Reading a file from the web with pandas¶

Overview:

Teaching: 10 min
Exercises: 5 min

Questions

How can I read data from the web into pandas?

Objectives

Use pandas to download and read in a CSV file for the web.

Increasingly data is being made openly available on the internet. All research data that supports published work should where possible be published and the UK government has adopted an open data policy to increase transparency e.g centrally via https://data.gov.uk/ and the Office of National Statistics: https://www.ons.gov.uk/.

For this lesson we will focus on data related to COVID-19/Cornoavirus which are available at https://coronavirus.data.gov.uk/details/download. We have used this to generate a link to the latest daily and cumulated recorded cases and deaths in csv format. You can explore the page to extract different data and generate your own link. We will read this directly into pandas so that whenever we run our analysis we will always be working with the latest data.

First we will assign the url to a variable:

data_url='https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=cumCasesBySpecimenDate&metric=cumDeaths28DaysByDeathDate&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&format=csv'

Because this link will generate a csv file, we can use the pandas function read_csv() just as before, to read the file and convert it to a DataFrame.

import pandas as pd

covid_data=pd.read_csv(data_url)
covid_data

We can see that by default it's done a good job of parsing the file, as the open data is designed to be as usable as possible. In selecting a subset of this data it would be useful to treat the date as a datetime value rather than a string and possibly use this as the index of the dataset. First let's first check what datatype have been assigned to each of the columns:

dtypes = covid_data.dtypes
print(dtypes)
print(type(dtypes))

date                           object
areaType                       object
areaCode                       object
areaName                       object
cumCasesBySpecimenDate        float64
cumDeaths28DaysByDeathDate    float64
newCasesBySpecimenDate        float64
newDeaths28DaysByDeathDate    float64
dtype: object
<class 'pandas.core.series.Series'>

This tells is that the date is of type of object which is how pandas treats strings. We can use an inbuilt pandas function to convert this one column to a datetime format and reassign it to the date column. In order to reduce the amount of data printed to the screen we can also use dataframe_name.head() to print just the first few lines of a dataframe.

covid_data['date'] = pd.to_datetime(covid_data['date'])
covid_data.head()

We don't see any difference in the output but can verify that the datatype has been converted by checking the types again:

dtypes = covid_data.dtypes
print(dtypes)
print(type(dtypes))

date                          datetime64[ns]
areaType                              object
areaCode                              object
areaName                              object
cumCasesBySpecimenDate               float64
cumDeaths28DaysByDeathDate           float64
newCasesBySpecimenDate               float64
newDeaths28DaysByDeathDate           float64
dtype: object
<class 'pandas.core.series.Series'>

Let's verify that we can select a range of dates. We need to import the datetime library explicitly and assign start and end dates, for which we will choose 1st October 2020 and 1st November respectively.

import datetime
start_date=datetime.datetime(year=2020,month=10,day=1)
end_date=datetime.datetime(year=2020,month=10,day=31)

We can now use these to select a subset of the original dataframe for the month of October. To do this we need to identify all data where the date is greater than or equal to the start date AND less than the end date. As we saw in the first episode of the lesson we can do this by generating a boolean mask. First to select data after the start date

covid_data['date'] >= start_date

0       True
1       True
2       True
3       True
4       True
       ...  
324    False
325    False
326    False
327    False
328    False
Name: date, Length: 329, dtype: bool

second before the end date:

covid_data['date'] <= end_date

0      False
1      False
2      False
3      False
4      False
       ...  
324     True
325     True
326     True
327     True
328     True
Name: date, Length: 329, dtype: bool

But we need both of these to be true so we combine them with the boolean and operator:

(covid_data['date'] >= start_date) & (covid_data['date'] <= end_date)

0      False
1      False
2      False
3      False
4      False
       ...  
324    False
325    False
326    False
327    False
328    False
Name: date, Length: 329, dtype: bool

While it is useful to know about this way of combining masks there is also a useful helper function in pandas between that can handle the logic for us:

covid_data['date'].between(start_date,end_date)

0      False
1      False
2      False
3      False
4      False
       ...  
324    False
325    False
326    False
327    False
328    False
Name: date, Length: 329, dtype: bool

Now we can use this mask to select the data that we want to plot or analyse further:

covid_oct = covid_data[covid_data['date'].between(start_date,end_date)]
covid_oct.head()

Let's now plot the number of daily cases against the date:

import matplotlib.pyplot as plt

covid_oct.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
plt.xlabel('Date')
plt.ylabel('Daily Cases')

Text(0, 0.5, 'Daily Cases')

Explore data through plots

Create plots of different periods and datasets to explore how to use pandas to plot data.
Can you create a plot of the ratio of COVID deaths to cases

Key Points:

Pandas read_csv() function can also be used to read CSV files directly from the web.
We can then manipulate and plot the data or subsets of it.

	date	areaType	areaCode	areaName	cumCasesBySpecimenDate	cumDeaths28DaysByDeathDate	newCasesBySpecimenDate	newDeaths28DaysByDeathDate
0	2020-11-26	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
1	2020-11-25	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
2	2020-11-24	overview	K02000001	United Kingdom	1572057.0	56857.0	11242.0	300.0
3	2020-11-23	overview	K02000001	United Kingdom	1560815.0	56557.0	17798.0	360.0
4	2020-11-22	overview	K02000001	United Kingdom	1543017.0	56197.0	11729.0	407.0
...	...	...	...	...	...	...	...	...
324	2020-01-07	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
325	2020-01-06	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
326	2020-01-05	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
327	2020-01-04	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
328	2020-01-03	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN

	date	areaType	areaCode	areaName	cumCasesBySpecimenDate	cumDeaths28DaysByDeathDate	newCasesBySpecimenDate	newDeaths28DaysByDeathDate
0	2020-11-26	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
1	2020-11-25	overview	K02000001	United Kingdom	NaN	NaN	NaN	NaN
2	2020-11-24	overview	K02000001	United Kingdom	1572057.0	56857.0	11242.0	300.0
3	2020-11-23	overview	K02000001	United Kingdom	1560815.0	56557.0	17798.0	360.0
4	2020-11-22	overview	K02000001	United Kingdom	1543017.0	56197.0	11729.0	407.0

	date	areaType	areaCode	areaName	cumCasesBySpecimenDate	cumDeaths28DaysByDeathDate	newCasesBySpecimenDate	newDeaths28DaysByDeathDate
26	2020-10-31	overview	K02000001	United Kingdom	1063139.0	47466.0	16503.0	319.0
27	2020-10-30	overview	K02000001	United Kingdom	1046636.0	47147.0	22682.0	338.0
28	2020-10-29	overview	K02000001	United Kingdom	1023954.0	46809.0	23335.0	310.0
29	2020-10-28	overview	K02000001	United Kingdom	1000619.0	46499.0	23622.0	281.0
30	2020-10-27	overview	K02000001	United Kingdom	976997.0	46218.0	24113.0	266.0

Reading a file from the web with pandas¶

Overview:

Explore data through plots

Key Points:

Previous

Schedule

Next