Increasingly data is being made openly available on the internet. All research data that supports published work should where possible be published and the UK government has adopted an open data policy to increase transparency e.g centrally via https://data.gov.uk/ and the Office of National Statistics: https://www.ons.gov.uk/.
For this lesson we will focus on data related to COVID-19/Cornoavirus which are available at https://coronavirus.data.gov.uk/details/download. We have used this to generate a link to the latest daily and cumulated recorded cases and deaths in csv format. You can explore the page to extract different data and generate your own link. We will read this directly into pandas so that whenever we run our analysis we will always be working with the latest data.
First we will assign the url to a variable:
data_url='https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=cumCasesBySpecimenDate&metric=cumDeaths28DaysByDeathDate&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&format=csv'
Because this link will generate a csv file, we can use the pandas function read_csv()
just as before, to read the file and convert it to a DataFrame
.
import pandas as pd
covid_data=pd.read_csv(data_url)
covid_data
We can see that by default it's done a good job of parsing the file, as the open data is designed to be as usable as possible. In selecting a subset of this data it would be useful to treat the date as a datetime
value rather than a string and possibly use this as the index of the dataset. First let's first check what datatype
have been assigned to each of the columns:
dtypes = covid_data.dtypes
print(dtypes)
print(type(dtypes))
This tells is that the date is of type of object
which is how pandas treats strings. We can use an inbuilt pandas function to convert this one column to a datetime
format and reassign it to the date column
. In order to reduce the amount of data printed to the screen we can also use dataframe_name.head()
to print just the first few lines of a dataframe.
covid_data['date'] = pd.to_datetime(covid_data['date'])
covid_data.head()
We don't see any difference in the output but can verify that the datatype has been converted by checking the types again:
dtypes = covid_data.dtypes
print(dtypes)
print(type(dtypes))
Let's verify that we can select a range of dates. We need to import the datetime
library explicitly and assign start and end dates, for which we will choose 1st October 2020 and 1st November respectively.
import datetime
start_date=datetime.datetime(year=2020,month=10,day=1)
end_date=datetime.datetime(year=2020,month=10,day=31)
We can now use these to select a subset of the original dataframe for the month of October. To do this we need to identify all data where the date is greater than or equal to the start date AND less than the end date. As we saw in the first episode of the lesson we can do this by generating a boolean mask. First to select data after the start date
covid_data['date'] >= start_date
second before the end date:
covid_data['date'] <= end_date
But we need both of these to be true so we combine them with the boolean and operator:
(covid_data['date'] >= start_date) & (covid_data['date'] <= end_date)
While it is useful to know about this way of combining masks there is also a useful helper function in pandas between
that can handle the logic for us:
covid_data['date'].between(start_date,end_date)
Now we can use this mask to select the data that we want to plot or analyse further:
covid_oct = covid_data[covid_data['date'].between(start_date,end_date)]
covid_oct.head()
Let's now plot the number of daily cases against the date:
import matplotlib.pyplot as plt
covid_oct.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
plt.xlabel('Date')
plt.ylabel('Daily Cases')