One of the most common situations is that you have some data file containing the data you want to read. Perhaps this is data you've produced yourself or maybe it's from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into pandas but since this is so often not the case, it provides a number of features to make your ife easier.
Full information on reading and writing is available in the pandas manual on IO tools but first it's worth noting the common formats that pandas can work with:
For this lesson we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV file like (if you are not running on notebooks.azure.com you will need to download this file from city_pop.csv):
!cat ./data/city_pop.csv # Uses the IPython 'magic' !cat to print the file
We can use the pandas function read_csv()
to read the file and convert it to a DataFrame
. Full documentation for this function can be found in the manual or, as with any Python object, directly in the notebook by typing help(pd.read_csv)
.
import pandas as pd
csv_file = './data/city_pop.csv'
pd.read_csv(csv_file)
We can see that by default it's done a fairly bad job of parsing the file (this is mostly because it has been construsted to be as obtuse as possible). It's making a lot of assumptions about the structure of the file but in general it's taking quite a naïve approach.
The first this we notice is that it's treating the text at the top of the file as though it's data. Checking the documentation we see that the simplest way to solve this is to use the skiprows
argument to the function to which we give an integer giving the number of rows to skip:
pd.read_csv(csv_file,
skiprows=5,
)
The next most obvious problem is that it is not separating the columns at all. This is controlled by the sep
argument which is set to ','
by default (hence comma separated values). We can simply set it to the appropriate semi-colon:
pd.read_csv(csv_file,
skiprows=5,
sep=';'
)
Reading the descriptive header of our data file we see that a value of -1
signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the na_values
argument:
pd.read_csv(csv_file,
skiprows=5,
sep=';',
na_values='-1'
)
The last this we want to do is use the year
column as the index for the DataFrame
. This can be done by passing the name of the column to the index_col
argument:
df3 = pd.read_csv(csv_file,
skiprows=5,
sep=';',
na_values='-1',
index_col='year'
)
df3