Reading a file with pandas

Overview:

  • Teaching: 10 min
  • Exercises: 5 min

Questions

  • How can I read my data file into pandas?

Objectives

  • Use pandas to read in a CSV file.

One of the most common situations is that you have some data file containing the data you want to read. Perhaps this is data you've produced yourself or maybe it's from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into pandas but since this is so often not the case, it provides a number of features to make your ife easier.

Full information on reading and writing is available in the pandas manual on IO tools but first it's worth noting the common formats that pandas can work with:

  • Comma separated tables (or tab-separated or space-separated etc.)
  • Excel spreadsheets
  • HDF5 files
  • SQL databases

For this lesson we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV file like (if you are not running on notebooks.azure.com you will need to download this file from city_pop.csv):

In [1]:
!cat ./data/city_pop.csv  # Uses the IPython 'magic' !cat to print the file
This is an example CSV file
The text at the top here is not part of the data but instead is here
to describe the file. You'll see this quite often in real-world data.
A -1 signifies a missing value.

year;London;Paris;Rome
2001;7.322;2.148;2.547
2006;7.652;;2.627
2008;-1;2.211;
2009;-1;2.234;2.734
2011;8.174;;
2012;-1;2.244;2.627
2015;8.615;;

We can use the pandas function read_csv() to read the file and convert it to a DataFrame. Full documentation for this function can be found in the manual or, as with any Python object, directly in the notebook by typing help(pd.read_csv).

In [4]:
import pandas as pd

csv_file = './data/city_pop.csv'
pd.read_csv(csv_file)
Out[4]:
This is an example CSV file
0 The text at the top here is not part of the da...
1 to describe the file. You'll see this quite of...
2 A -1 signifies a missing value.
3 year;London;Paris;Rome
4 2001;7.322;2.148;2.547
5 2006;7.652;;2.627
6 2008;-1;2.211;
7 2009;-1;2.234;2.734
8 2011;8.174;;
9 2012;-1;2.244;2.627
10 2015;8.615;;

We can see that by default it's done a fairly bad job of parsing the file (this is mostly because it has been construsted to be as obtuse as possible). It's making a lot of assumptions about the structure of the file but in general it's taking quite a naïve approach.

The first this we notice is that it's treating the text at the top of the file as though it's data. Checking the documentation we see that the simplest way to solve this is to use the skiprows argument to the function to which we give an integer giving the number of rows to skip:

In [6]:
pd.read_csv(csv_file,
            skiprows=5,
            )
Out[6]:
year;London;Paris;Rome
0 2001;7.322;2.148;2.547
1 2006;7.652;;2.627
2 2008;-1;2.211;
3 2009;-1;2.234;2.734
4 2011;8.174;;
5 2012;-1;2.244;2.627
6 2015;8.615;;

Editing cells

If you are following along with this material in a notebook, don't forget you can edit a cell and execute it again. In this lesson, you can just keep modifying the input to the read_csv() function and re-execute the cell, rather than making a new cell for each modification.

The next most obvious problem is that it is not separating the columns at all. This is controlled by the sep argument which is set to ',' by default (hence comma separated values). We can simply set it to the appropriate semi-colon:

In [7]:
pd.read_csv(csv_file,
            skiprows=5,
            sep=';'
           )
Out[7]:
year London Paris Rome
0 2001 7.322 2.148 2.547
1 2006 7.652 NaN 2.627
2 2008 -1.000 2.211 NaN
3 2009 -1.000 2.234 2.734
4 2011 8.174 NaN NaN
5 2012 -1.000 2.244 2.627
6 2015 8.615 NaN NaN

Reading the descriptive header of our data file we see that a value of -1 signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the na_values argument:

In [9]:
pd.read_csv(csv_file,
            skiprows=5,
            sep=';',
            na_values='-1'
           )
Out[9]:
year London Paris Rome
0 2001 7.322 2.148 2.547
1 2006 7.652 NaN 2.627
2 2008 NaN 2.211 NaN
3 2009 NaN 2.234 2.734
4 2011 8.174 NaN NaN
5 2012 NaN 2.244 2.627
6 2015 8.615 NaN NaN

The last this we want to do is use the year column as the index for the DataFrame. This can be done by passing the name of the column to the index_col argument:

In [10]:
df3 = pd.read_csv(csv_file,
                  skiprows=5,
                  sep=';',
                  na_values='-1',
                  index_col='year'
                 )
df3
Out[10]:
London Paris Rome
year
2001 7.322 2.148 2.547
2006 7.652 NaN 2.627
2008 NaN 2.211 NaN
2009 NaN 2.234 2.734
2011 8.174 NaN NaN
2012 NaN 2.244 2.627
2015 8.615 NaN NaN

Comma separated files

  • There is another file called cetml1659on.dat (available from here). This contains some historical weather data for a location in the UK. Import that file as a Pandas DataFrame using read_csv(), making sure that you cover all the NaN values. Be sure to look at the documentation for read_csv().
  • How many years had a negative average temperature in January?
  • What was the average temperature in June over the years in the data set? Tip: look in the documentation for which method to call.

We will come back to this data set at a later stage.

Hints for the first part:

  • The syntax for whitespace delimited data is sep='\s+', which is not immediately obvious from the documentation.
  • The data is almost comlete (which is unusual for scientific data) and there are only two invalid entries. Look at the last row of the file and, given that the data is temperature data, deduce which values need to be na_values. (You can use a list to give multiple na_values)
  • If you can't work out how to do the first part of this exercise, take a look at the solutions.

Solution

Key Points:

  • Pandas provides the read_csv() function for reading in CSV files.
  • Although it saves us a lot of work the syntax can be quite tricky.