numpy
to fit your data¶In addition to documentation/references given throughout, this episode draws on the material in https://towardsdatascience.com/regression-plots-with-pandas-and-numpy-faf2edbfad4f.
Generally it is good practice to import all libraries you will need at the tpo of your note book so that it your dependencies are clear:
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
We are importing pandas, datetime, matplotlib.pyplot and numpy. numpy is a numerical library which underpins the data structures used in pandas. Here we are going to use the features of numpy to analyse the data. First we will download the covid data again and extract a subset to analyse from the 'first wave'.
data_url='https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=cumCasesBySpecimenDate&metric=cumDeaths28DaysByDeathDate&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&format=csv'
covid_data=pd.read_csv(data_url)
covid_data.head()
As before we need to convert the date to a datetime
format:
covid_data['date'] = pd.to_datetime(covid_data['date'])
To focus on the first wave we will set start and end dates to 1st March 2020 and 1st June 2020:
start_date=datetime.datetime(year=2020,month=3,day=1)
end_date=datetime.datetime(year=2020,month=5,day=31)
As we saw earlier we can now use these to select a range of data
covid_wave1 = covid_data[covid_data['date'].between(start_date, end_date)]
covid_wave1.head()
Let's now plot the number of daily cases against the date:
covid_wave1.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
plt.xlabel('Date')
plt.ylabel('Daily Cases')
This data is showing very different behaviour at different times so lets select a smaller range covering just March
start_date=datetime.datetime(year=2020,month=3,day=1)
end_date=datetime.datetime(year=2020,month=3,day=30)
covid_march = covid_data[covid_data['date'].between(start_date, end_date) ]
covid_march.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
plt.xlabel('Date')
plt.ylabel('Daily Cases')
Now we can generate a fit to the data. There are many libraries such as scipy that we could use to do this see e.g. the docs and lectures however we are going to use numpy's polyfit
function.
First we need to modify the dataset as it is currently in reverse time order:
covid_march.head()
To reverse the order into chronological we need to invert the dataframe using iloc[::-1] and reset the index using reset_index:
covid_march=covid_march.iloc[::-1].reset_index(drop=True)
covid_march.head()
Now we can fit the data. polyfit
requires us to specify the x
and y
datasets and a 'degree'. The degree
refers to the highest power of x
you want to use, we will being with degree 1 which means fitting the data using a function of the form $y=ax+b$. We will use the index for the x values, rather than the date
, think why this might be:
fit1dcoeff = np.polyfit(x=covid_march.index, y=covid_march['newCasesBySpecimenDate'],deg=1)
fit1dcoeff
The function returns the coefficients that have been obtained using a least squares fit. Note that the order of the coefficients is the opposite of what we might expect, the gradient, $a$, is the first value in the list and the intercept, $b$, is the second value. We could no generate the fitted data ourselves, however it is easier to use the numpy helper function, poly1d
to add the fitted data to the dataframe. First this creates a function fit
which we then use to calculate a value for each value of the index:
fit = np.poly1d(fit1dcoeff)
covid_march['casesFit']=fit(covid_march.index)
covid_march.head()
It's easier to explore this visually so let's plot our fitted function over the original data to see how good the fit is, note that we have to assign the plots to the same set of axes (ax
) in order to plot one over the other:
ax = covid_march.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
covid_march.plot(x='date', y='casesFit', label='Fit', ax=ax)
plt.xlabel('Date')
plt.ylabel('Daily Cases')
In discussing the number of daily cases we are often told that the growth is exponential. Can we use what we have been exploring to fit an exponential to the data?
We expect an exponential to have the form
$$ y = A\exp^{(Bx)} $$if we take natural logs of both sides we have:
$$ \ln y = \ln (A) + Bx. $$So if we fit $\ln(y)$ against $x$ the intercept will be $\ln A$ and the gradient $B$:
fitexpcoeff = np.polyfit(x=covid_march.index, y=np.log(covid_march['newCasesBySpecimenDate']),deg=1)
fitexpcoeff
As we did with the previous fit we can now add a new column calculating the fitted exponential value by noting that $A$ will be given by the exponential of the intercept of the fit i.e, $A = \exp^{\mathrm{intercept}}$ and $B$ will be the calculated gradient:
def fitexp(x, coeff):
return np.exp(coeff[1]) * np.exp(coeff[0]*x)
covid_march['casesFitexp']=fitexp(covid_march.index, fitexpcoeff)
covid_march.head()
ax = covid_march.plot(x='date', y='newCasesBySpecimenDate', label='Total UK')
covid_march.plot(x='date', y='casesFitexp', label='Fitexp', ax=ax)
plt.xlabel('Date')
plt.ylabel('Daily Cases')