Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. The core of pandas is its dataframe which is essentially a table of data. Pandas provides easy and powerful ways to import data from a variety of sources and export it to just as many. It is also explicitly designed to handle missing data elegantly which is a very common problem in data from the real world.
The offical pandas documentation is very comprehensive and you will be answer a lot of questions in there, however, it can sometimes be hard to find the right page. Don't be afraid to use Google to find help.
Just like numpy, pandas has a standard convention for importing it:
import pandas as pd
We also explicitly import Series
and DataFrame
as we will be using them a lot.
from pandas import Series, DataFrame
The simplest of pandas' data structures is the Series
. It is a one-dimensional list-like structure.
Let's create one from a list
:
Series([14, 7, 3, -7, 8])
There are three main components to this output.
The first column (0
, 2
, etc.) is the index, by default this is numbers each row starting from zero.
The second column is our data, stored i the same order we entered it in our list.
Finally at the bottom there is the dtype
which stands for 'data type' which is telling us that all our data is being stored as a 64-bit integer.
Usually you can ignore the dtype
until you start doing more advanced things.
We previously came across dtype
s when learing about NumPy. This is because pandas
uses NumPy as its underlying library. A pandas.Series
is essentially a np.array
with some extra features wrapped around it.
In the first example above we allowed pandas to automatically create an index for our Series
(this is the 0
, 1
, 2
, etc. in the left column) but often you will want to specify one yourself
s = Series([14, 7, 3, -7, 8], index=['a', 'b', 'c', 'd', 'e'])
print(s)
We can use this index to retrieve individual rows
s['a']
to replace values in the series
s['c'] = -1
or to get a set of rows
s[['a', 'c', 'd']]
A Series
is list
-like in the sense that it is an ordered set of values. It is also dict
-like since its entries can be accessed via key lookup. One very important way in which is differs is how it allows operations to be done over the whole Series
in one go, a technique often referred to as 'broadcasting'. It should also be noted, that since these series objects are based on NumPy arrays, any slicing or bradcasting operation in this section can also be applied to a NumPy array, with the same result.
A simple example is wanting to double the value of every entry in a set of data. In standard Python, you might have a list like
my_list = [3, 6, 8, 4, 10]
If you wanted to double every entry you might try simply multiplying the list by 2
:
my_list * 2
but as you can see, that simply duplicated the elements. Instead you would have to use a for
loop or a list comprehension:
[i * 2 for i in my_list]
With a pandas Series
, however, you can perform bulk mathematical operations to the whole series in one go:
my_series = Series(my_list)
print(my_series)
my_series * 2
As well as bulk modifications, you can perform bulk selections by putting more complex statements in the square brackets:
s[s < 0] # All negative entries
s[(s * 2) > 4] # All entries which, when doubled are greater than 4
These operations work because the Series
index selection can be passed a series of True
and False
values which it then uses to filter the result:
(s * 2) > 4
Here you can see that the rows a
, b
and e
are True
while the others are False
. Passing this to s[...]
will only show rows that are True
.
It is also possible to perform operations between two Series
objects:
s2 = Series([23,5,34,7,5])
s3 = Series([7, 6, 5,4,3])
s2 - s3
While you can think of the Series
as a one-dimensional list of data, pandas' DataFrame
is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a Series
.
data = {'city': ['Paris', 'Paris', 'Paris', 'Paris',
'London', 'London', 'London', 'London',
'Rome', 'Rome', 'Rome', 'Rome'],
'year': [2001, 2008, 2009, 2010,
2001, 2006, 2011, 2015,
2001, 2006, 2009, 2012],
'pop': [2.148, 2.211, 2.234, 2.244,
7.322, 7.657, 8.174, 8.615,
2.547, 2.627, 2.734, 2.627]}
df = DataFrame(data)
This has created a DataFrame
from the dictionary data
. The keys will become the column headers and the values will be the values in each column. As with the Series
, an index will be created automatically.
df
Or, if you just want a peek at the data, you can just grab the first few rows with:
df.head(3)
Since we passed in a dictionary to the DataFrame
constructor, the order of the columns will not necessarilly match the order in which you defined them. To enforce a certain order, you can pass a columns
argument to the constructor giving a list of the columns in the order you want them:
DataFrame(data, columns=['year', 'city', 'pop'])
When we accessed elements from a Series
object, it would select an element by row. However, by default DataFrame
s index primarily by column. You can access any column directly by using square brackets or by named attributes:
df['year']
df.city
Accessing a column like this returns a Series
which will act in the same way as those we were using earlier.
Note that there is one additional part to this output, Name: city
. Pandas has remembered that this Series
was created from the 'city'
column in the DataFrame
.
type(df.city)
df.city == 'Paris'
This has created a new Series
which has True
set where the city is Paris and False
elsewhere.
We can use filtered Series
like this to filter the DataFrame
as a whole. df.city == 'Paris'
has returned a Series
containing booleans. Passing it back into df
as an indexing operation will use it to filter based on the 'city'
column.
df[df.city == 'Paris']
You can then carry on and grab another column after that filter:
df[df.city == 'Paris'].year
If you want to select a row from a DataFrame
then you can use the .loc
attribute which allows you to pass index values like:
df.loc[2]
df.loc[2]['city']
New columns can be added to a DataFrame
simply by assigning them by index (as you would for a Python dict
) and can be deleted with the del
keyword in the same way:
df['continental'] = (df.city != 'London')
df
del df['continental']