Now Code

Processing Data from the Web

Overview

  • Teaching: 0 min
  • Exercises: 90 min

Questions

  • How can I access data from the web?
  • How can I mine these files for data?

Objectives

  • Be able to access data from websites
  • Process these files using pandas
  • Analyse and plot the data

Getting Started

Open Jupyter Notebooks and change to this directory. Then open a new note notebook.

In this exercise you will download data from the web into a panda dataframe and then write some analysis and plotting code. The dataset that we will analyse is taken from co2.earth specifically the monthly data. Before processing the data, download and inspect it so that you know what to expect it to look like when you open it in a dataframe.

1: Open the data

In a single cell write the code to open the dataset and display it. The full link for the dataset is:

ftp./data.iac.ethz.ch/CMIP6/input4MIPs/UoM/GHGConc/CMIP/mon/atmos/UoM-CMIP-1-1-0/GHGConc/gr3-GMNHSH/v20160701/mole_fraction_of_carbon_dioxide_in_air_input4MIPs_GHGConcentrations_CMIP_UoM-CMIP-1-1-0_gr3-GMNHSH_000001-201412.csv

Hint: You should be able to open csv files straight from a web link using the pandas function read_csv().

Pandas does a good job of interpreting the data, e.g. as float, int however you can tell it to import certain columns as specific data types. Inspect the data and think about which columns might need to be imported in this way.

2: Plot the data

Now, we wish to prepare a plot comparing the dataset for the 5 years from 2000-5, but to start with plot the datasets data_mean_global, with a line like co2_data.data_mean_global.plot().

By setting the labels, legend, limits and ticks, plot all three data sets to prepare a plot that looks like:

CO2 data plot

and once you have it looking correct, save it to a file.

We cannot easily make use of the column datetime due to the way in which the Python datetime object works. So think about how you might use the ID and knowledge of the data to generate the year, or alternatively create a new column of data date which expresses the year and month in a format that you can use.

3: Aggregate and Collect

Now we wish to calculate the mean, maximum and minimum annual CO2 levels globally and in each hemisphere. Construct a single data frame which has 9 'columns', three aggregate values for each of the three datasets, with the year as the index and appropriate column labels.

You will need to make use of the aggregate function that we saw in 'Working with Data and Plotting', you may also find the functions drop, columns and concat useful, they been be explored in more detail and the Pandas documentation.

Sandpit

Using the data that you produced in the previous exercise, can you prepare a plot of the mean CO2 levels and adds the min and max as 'error' bars. These will make use of the yerr keyword, as always you are encouraged to look at the documentation and examples you might find. Your final plot should look somthing like:

Annual CO2 plot

Key Points

  • Pandas read_csv function can be used to read data straight from the web.
  • Pandas native datatime format is unable to deal with long time ranges.
  • Inspect your data to understand how you need to treat it.
  • Make use of the extensive documentation for python libraries!
  • Decorate your plots with label, ticks, legend and error bars.