Introduction to pandas

Overview:

  • Teaching: 20 min
  • Exercises: 10 min

Questions

  • What is pandas?
  • Why should I use series and data frames?

Objectives

  • Use pandas to convert lists to series.
  • Learn about slicing and broadcasting series (and by extension NumPy arrays).
  • Use pandas to convert dicts to data frames.

Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. The core of pandas is its dataframe which is essentially a table of data. Pandas provides easy and powerful ways to import data from a variety of sources and export it to just as many. It is also explicitly designed to handle missing data elegantly which is a very common problem in data from the real world.

The offical pandas documentation is very comprehensive and you will be answer a lot of questions in there, however, it can sometimes be hard to find the right page. Don't be afraid to use Google to find help.

Just like numpy, pandas has a standard convention for importing it:

In [1]:
import pandas as pd

We also explicitly import Series and DataFrame as we will be using them a lot.

In [2]:
from pandas import Series, DataFrame

Series

The simplest of pandas' data structures is the Series. It is a one-dimensional list-like structure. Let's create one from a list:

In [3]:
Series([14, 7, 3, -7, 8])
Out[3]:
0    14
1     7
2     3
3    -7
4     8
dtype: int64

There are three main components to this output. The first column (0, 2, etc.) is the index, by default this is numbers each row starting from zero. The second column is our data, stored i the same order we entered it in our list. Finally at the bottom there is the dtype which stands for 'data type' which is telling us that all our data is being stored as a 64-bit integer. Usually you can ignore the dtype until you start doing more advanced things.

We previously came across dtypes when learing about NumPy. This is because pandas uses NumPy as its underlying library. A pandas.Series is essentially a np.array with some extra features wrapped around it.

In the first example above we allowed pandas to automatically create an index for our Series (this is the 0, 1, 2, etc. in the left column) but often you will want to specify one yourself

In [4]:
s = Series([14, 7, 3, -7, 8], index=['a', 'b', 'c', 'd', 'e'])
print(s)
a    14
b     7
c     3
d    -7
e     8
dtype: int64

We can use this index to retrieve individual rows

In [5]:
s['a']
Out[5]:
14

to replace values in the series

In [6]:
s['c'] = -1

or to get a set of rows

In [7]:
s[['a', 'c', 'd']]
Out[7]:
a    14
c    -1
d    -7
dtype: int64

Make a Series

  • Create a Pandas Series with 10 or so elements where the indices are years and the values are numbers.
  • Experiment with retrieving elements from the Series.
  • Try making another Series with duplicate values in the index, what happens when you access those elements?
  • How does a Pandas Series differ from a Python list or dict?

Solution

Series operations

A Series is list-like in the sense that it is an ordered set of values. It is also dict-like since its entries can be accessed via key lookup. One very important way in which is differs is how it allows operations to be done over the whole Series in one go, a technique often referred to as 'broadcasting'. It should also be noted, that since these series objects are based on NumPy arrays, any slicing or bradcasting operation in this section can also be applied to a NumPy array, with the same result.

A simple example is wanting to double the value of every entry in a set of data. In standard Python, you might have a list like

In [12]:
my_list = [3, 6, 8, 4, 10]

If you wanted to double every entry you might try simply multiplying the list by 2:

In [13]:
my_list * 2
Out[13]:
[3, 6, 8, 4, 10, 3, 6, 8, 4, 10]

but as you can see, that simply duplicated the elements. Instead you would have to use a for loop or a list comprehension:

In [14]:
[i * 2 for i in my_list]
Out[14]:
[6, 12, 16, 8, 20]

With a pandas Series, however, you can perform bulk mathematical operations to the whole series in one go:

In [15]:
my_series = Series(my_list)
print(my_series)
0     3
1     6
2     8
3     4
4    10
dtype: int64
In [16]:
my_series * 2
Out[16]:
0     6
1    12
2    16
3     8
4    20
dtype: int64

As well as bulk modifications, you can perform bulk selections by putting more complex statements in the square brackets:

In [17]:
s[s < 0]  # All negative entries
Out[17]:
c   -1
d   -7
dtype: int64
In [18]:
s[(s * 2) > 4]  # All entries which, when doubled are greater than 4
Out[18]:
a    14
b     7
e     8
dtype: int64

These operations work because the Series index selection can be passed a series of True and False values which it then uses to filter the result:

In [19]:
(s * 2) > 4
Out[19]:
a     True
b     True
c    False
d    False
e     True
dtype: bool

Here you can see that the rows a, b and e are True while the others are False. Passing this to s[...] will only show rows that are True.

Multi-Series operations

It is also possible to perform operations between two Series objects:

In [20]:
s2 = Series([23,5,34,7,5])
s3 = Series([7, 6, 5,4,3])
s2 - s3
Out[20]:
0    16
1    -1
2    29
3     3
4     2
dtype: int64

Broadcasting

  • Create two Series objects of equal length with no specified index and containing any values you like. Perform some mathematical operations on them and experiment to make sure it works how you think.
  • What happens then you perform an operation on two series which have different lengths? How does this change when you give the series some indices?
  • Using the Series from the first exercise with the years for the index, Select all entries with even-numbered years. Also, select all those with odd-numbered years.

Solution

DataFrame

While you can think of the Series as a one-dimensional list of data, pandas' DataFrame is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a Series.

In [23]:
data = {'city': ['Paris', 'Paris', 'Paris', 'Paris',
                 'London', 'London', 'London', 'London',
                 'Rome', 'Rome', 'Rome', 'Rome'],
        'year': [2001, 2008, 2009, 2010,
                 2001, 2006, 2011, 2015,
                 2001, 2006, 2009, 2012],
        'pop': [2.148, 2.211, 2.234, 2.244,
                7.322, 7.657, 8.174, 8.615,
                2.547, 2.627, 2.734, 2.627]}
df = DataFrame(data)

This has created a DataFrame from the dictionary data. The keys will become the column headers and the values will be the values in each column. As with the Series, an index will be created automatically.

In [24]:
df
Out[24]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234
3 Paris 2010 2.244
4 London 2001 7.322
5 London 2006 7.657
6 London 2011 8.174
7 London 2015 8.615
8 Rome 2001 2.547
9 Rome 2006 2.627
10 Rome 2009 2.734
11 Rome 2012 2.627

Or, if you just want a peek at the data, you can just grab the first few rows with:

In [25]:
df.head(3)
Out[25]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234

Since we passed in a dictionary to the DataFrame constructor, the order of the columns will not necessarilly match the order in which you defined them. To enforce a certain order, you can pass a columns argument to the constructor giving a list of the columns in the order you want them:

In [26]:
DataFrame(data, columns=['year', 'city', 'pop'])
Out[26]:
year city pop
0 2001 Paris 2.148
1 2008 Paris 2.211
2 2009 Paris 2.234
3 2010 Paris 2.244
4 2001 London 7.322
5 2006 London 7.657
6 2011 London 8.174
7 2015 London 8.615
8 2001 Rome 2.547
9 2006 Rome 2.627
10 2009 Rome 2.734
11 2012 Rome 2.627

When we accessed elements from a Series object, it would select an element by row. However, by default DataFrames index primarily by column. You can access any column directly by using square brackets or by named attributes:

In [27]:
df['year']
Out[27]:
0     2001
1     2008
2     2009
3     2010
4     2001
5     2006
6     2011
7     2015
8     2001
9     2006
10    2009
11    2012
Name: year, dtype: int64
In [28]:
df.city
Out[28]:
0      Paris
1      Paris
2      Paris
3      Paris
4     London
5     London
6     London
7     London
8       Rome
9       Rome
10      Rome
11      Rome
Name: city, dtype: object

Accessing a column like this returns a Series which will act in the same way as those we were using earlier.

Note that there is one additional part to this output, Name: city. Pandas has remembered that this Series was created from the 'city' column in the DataFrame.

In [29]:
type(df.city)
Out[29]:
pandas.core.series.Series
In [30]:
df.city == 'Paris'
Out[30]:
0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: city, dtype: bool

This has created a new Series which has True set where the city is Paris and False elsewhere.

We can use filtered Series like this to filter the DataFrame as a whole. df.city == 'Paris' has returned a Series containing booleans. Passing it back into df as an indexing operation will use it to filter based on the 'city' column.

In [31]:
df[df.city == 'Paris']
Out[31]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234
3 Paris 2010 2.244

You can then carry on and grab another column after that filter:

In [32]:
df[df.city == 'Paris'].year
Out[32]:
0    2001
1    2008
2    2009
3    2010
Name: year, dtype: int64

If you want to select a row from a DataFrame then you can use the .loc attribute which allows you to pass index values like:

In [33]:
df.loc[2]
Out[33]:
city    Paris
year     2009
pop     2.234
Name: 2, dtype: object
In [34]:
df.loc[2]['city']
Out[34]:
'Paris'

Adding new columns

New columns can be added to a DataFrame simply by assigning them by index (as you would for a Python dict) and can be deleted with the del keyword in the same way:

In [38]:
df['continental'] = (df.city != 'London')
df
Out[38]:
city year pop continental
0 Paris 2001 2.148 True
1 Paris 2008 2.211 True
2 Paris 2009 2.234 True
3 Paris 2010 2.244 True
4 London 2001 7.322 False
5 London 2006 7.657 False
6 London 2011 8.174 False
7 London 2015 8.615 False
8 Rome 2001 2.547 True
9 Rome 2006 2.627 True
10 Rome 2009 2.734 True
11 Rome 2012 2.627 True
In [39]:
del df['continental']

Making your own dataframe

  • Create the DataFrame containing the census data for the three cities as we did above.
  • Select the data for the year 2001. Which city had the smallest population that year?
  • Find all the cities which had a population smaller than 2.6 million.

Solution

Key Points:

  • Series converts lists to pandas series.
  • Series can be sliced and broadcast together.
  • Pandas is a wrapper around NumPy.
  • By extension NumPy arrays can be sliced and broadcast together in the same way.
  • DataFrame converts dicts to pandas data frames.