Reading Data¶

Overview:

Teaching: 15 min
Exercises: 20 min

Questions

How can I read data into my program?
What libraries can I use to help me read in data?

Objectives

Use input to read from the keyboard.
Know that you need to open files and then read the contents.
Use libraries to read in files in standard formats.

Read from the keyboard¶

Occasionally you will want your user to input data via the keyboard, for instance this might be the name of a file to analyse. We can do this using the function input:

text = input("Enter some text: ")

Enter some text: Hello World!

We can verify that this has been read in correctly by printing the value in text.

print(text)

Hello World!

The value read in by input is always a string, we can check this with type. But let's first run input again but this time enter a number.

text = input("Now trying entering a number: ")

Now trying entering a number: 42

print(text, type(text))

42 <class 'str'>

input

We can ask our users to input text and read it from the keyboard using the function input. The value read in will always be a string and if we want to use text as a number we will first have to convert it with int or float.

Aside

We will now read in some data from files that were provided with this library. In order to do this we need to unzip the file. We do this using Python magic which allows us to run system commands within the Jupyter notebook. The ! means don't run a Python comman but run a system command. The command we wish to run is unzip:

!unzip intro-python-data.zip

Archive:  ../RS50001/python-novice-inflammation-data.zip
   creating: data/
  inflating: data/inflammation-01.csv  
  inflating: data/inflammation-02.csv  
  inflating: data/inflammation-03.csv  
  inflating: data/inflammation-04.csv  
  inflating: data/inflammation-05.csv  
  inflating: data/inflammation-06.csv  
  inflating: data/inflammation-07.csv  
  inflating: data/inflammation-08.csv  
  inflating: data/inflammation-09.csv  
  inflating: data/inflammation-10.csv  
  inflating: data/inflammation-11.csv  
  inflating: data/inflammation-12.csv  
 extracting: data/small-01.csv       
 extracting: data/small-02.csv       
 extracting: data/small-03.csv

You will also see Python magic which uses a % instead of !. These are for specific Python commands rather than system commands.

But all my data is in files¶

Most of the time the data we want to read will be in files. In order to read from a file we first need to open the file:

file = open("data/inflammation-01.csv")

When we open the file it is a bit like picking a book off the shelf and opening it at the first page. We have not yet read in any of the data contained in the file. In order to do this we must read from the file. In Python and many languages we can do this by reading each line of the file in turn.

line = file.readline()
print(line)

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0

Python allows us to treat the file as a collection of lines which it reads in automatically so we can use the more readable form:

for line in file:
    print(line)

0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1

0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1

0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1

0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

0,0,1,2,2,4,2,1,6,4,7,6,6,9,9,15,4,16,18,12,12,5,18,9,5,3,10,3,12,7,8,4,7,3,5,4,4,3,2,1

0,0,2,2,4,2,2,5,5,8,6,5,11,9,4,13,5,12,10,6,9,17,15,8,9,3,13,7,8,2,8,8,4,2,3,5,4,1,1,1

0,0,1,2,3,1,2,3,5,3,7,8,8,5,10,9,15,11,18,19,20,8,5,13,15,10,6,10,6,7,4,9,3,5,2,5,3,2,2,1

0,0,0,3,1,5,6,5,5,8,2,4,11,12,10,11,9,10,17,11,6,16,12,6,8,14,6,13,10,11,4,6,4,7,6,3,2,1,0,0

0,1,1,2,1,3,5,3,5,8,6,8,12,5,13,6,13,8,16,8,18,15,16,14,12,7,3,8,9,11,2,5,4,5,1,4,1,2,0,0

0,1,0,0,4,3,3,5,5,4,5,8,7,10,13,3,7,13,15,18,8,15,15,16,11,14,12,4,10,10,4,3,4,5,5,3,3,2,2,1

0,1,0,0,3,4,2,7,8,5,2,8,11,5,5,8,14,11,6,11,9,16,18,6,12,5,4,3,5,7,8,3,5,4,5,5,4,0,1,1

0,0,2,1,4,3,6,4,6,7,9,9,3,11,6,12,4,17,13,15,13,12,8,7,4,7,12,9,5,6,5,4,7,3,5,4,2,3,0,1

0,0,0,0,1,3,1,6,6,5,5,6,3,6,13,3,10,13,9,16,15,9,11,4,6,4,11,11,12,3,5,8,7,4,6,4,1,3,0,0

0,1,2,1,1,1,4,1,5,2,3,3,10,7,13,5,7,17,6,9,12,13,10,4,12,4,6,7,6,10,8,2,5,1,3,4,2,0,2,0

0,1,1,0,1,2,4,3,6,4,7,5,5,7,5,10,7,8,18,17,9,8,12,11,11,11,14,6,11,2,10,9,5,6,5,3,4,2,2,0

0,0,0,0,2,3,6,5,7,4,3,2,10,7,9,11,12,5,12,9,13,19,14,17,5,13,8,11,5,10,9,8,7,5,3,1,4,0,2,1

0,0,0,1,2,1,4,3,6,7,4,2,12,6,12,4,14,7,8,14,13,19,6,9,12,6,4,13,6,7,2,3,6,5,4,2,3,0,1,0

0,0,2,1,2,5,4,2,7,8,4,7,11,9,8,11,15,17,11,12,7,12,7,6,7,4,13,5,7,6,6,9,2,1,1,2,2,0,1,0

0,1,2,0,1,4,3,2,2,7,3,3,12,13,11,13,6,5,9,16,9,19,16,11,8,9,14,12,11,9,6,6,6,1,1,2,4,3,1,1

0,1,1,3,1,4,4,1,8,2,2,3,12,12,10,15,13,6,5,5,18,19,9,6,11,12,7,6,3,6,3,2,4,3,1,5,4,2,2,0

0,0,2,3,2,3,2,6,3,8,7,4,6,6,9,5,12,12,8,5,12,10,16,7,14,12,5,4,6,9,8,5,6,6,1,4,3,0,2,0

0,0,0,3,4,5,1,7,7,8,2,5,12,4,10,14,5,5,17,13,16,15,13,6,12,9,10,3,3,7,4,4,8,2,6,5,1,0,1,0

0,1,1,1,1,3,3,2,6,3,9,7,8,8,4,13,7,14,11,15,14,13,5,13,7,14,9,10,5,11,5,3,5,1,1,4,4,1,2,0

0,1,1,1,2,3,5,3,6,3,7,10,3,8,12,4,12,9,15,5,17,16,5,10,10,15,7,5,3,11,5,5,6,1,1,1,1,0,2,1

0,0,2,1,3,3,2,7,4,4,3,8,12,9,12,9,5,16,8,17,7,11,14,7,13,11,7,12,12,7,8,5,7,2,2,4,1,1,1,0

0,0,1,2,4,2,2,3,5,7,10,5,5,12,3,13,4,13,7,15,9,12,18,14,16,12,3,11,3,2,7,4,8,2,2,1,3,0,1,1

0,0,1,1,1,5,1,5,2,2,4,10,4,8,14,6,15,6,12,15,15,13,7,17,4,5,11,4,8,7,9,4,5,3,2,5,4,3,2,1

0,0,2,2,3,4,6,3,7,6,4,5,8,4,7,7,6,11,12,19,20,18,9,5,4,7,14,8,4,3,7,7,8,3,5,4,1,3,1,0

0,0,0,1,4,4,6,3,8,6,4,10,12,3,3,6,8,7,17,16,14,15,17,4,14,13,4,4,12,11,6,9,5,5,2,5,2,1,0,1

0,1,1,0,3,2,4,6,8,6,2,3,11,3,14,14,12,8,8,16,13,7,6,9,15,7,6,4,10,8,10,4,2,6,5,5,2,3,2,1

0,0,2,3,3,4,5,3,6,7,10,5,10,13,14,3,8,10,9,9,19,15,15,6,8,8,11,5,5,7,3,6,6,4,5,2,2,3,0,0

0,1,2,2,2,3,6,6,6,7,6,3,11,12,13,15,15,10,14,11,11,8,6,12,10,5,12,7,7,11,5,8,5,2,5,5,2,0,2,1

0,0,2,1,3,5,6,7,5,8,9,3,12,10,12,4,12,9,13,10,10,6,10,11,4,15,13,7,3,4,2,9,7,2,4,2,1,2,1,1

0,0,1,2,4,1,5,5,2,3,4,8,8,12,5,15,9,17,7,19,14,18,12,17,14,4,13,13,8,11,5,6,6,2,3,5,2,1,1,1

0,0,0,3,1,3,6,4,3,4,8,3,4,8,3,11,5,7,10,5,15,9,16,17,16,3,8,9,8,3,3,9,5,1,6,5,4,2,2,0

0,1,2,2,2,5,5,1,4,6,3,6,5,9,6,7,4,7,16,7,16,13,9,16,12,6,7,9,10,3,6,4,5,4,6,3,4,3,2,1

0,1,1,2,3,1,5,1,2,2,5,7,6,6,5,10,6,7,17,13,15,16,17,14,4,4,10,10,10,11,9,9,5,4,4,2,1,0,1,0

0,1,0,3,2,4,1,1,5,9,10,7,12,10,9,15,12,13,13,6,19,9,10,6,13,5,13,6,7,2,5,5,2,1,1,1,1,3,0,1

0,1,1,3,1,1,5,5,3,7,2,2,3,12,4,6,8,15,16,16,15,4,14,5,13,10,7,10,6,3,2,3,6,3,3,5,4,3,2,1

0,0,0,2,2,1,3,4,5,5,6,5,5,12,13,5,7,5,11,15,18,7,9,10,14,12,11,9,10,3,2,9,6,2,2,5,3,0,0,1

0,0,1,3,3,1,2,1,8,9,2,8,10,3,8,6,10,13,11,17,19,6,4,11,6,12,7,5,5,4,4,8,2,6,6,4,2,2,0,0

0,1,1,3,4,5,2,1,3,7,9,6,10,5,8,15,11,12,15,6,12,16,6,4,14,3,12,9,6,11,5,8,5,5,6,1,2,1,2,0

0,0,1,3,1,4,3,6,7,8,5,7,11,3,6,11,6,10,6,19,18,14,6,10,7,9,8,5,8,3,10,2,5,1,5,4,2,1,0,1

0,1,1,3,3,4,4,6,3,4,9,9,7,6,8,15,12,15,6,11,6,18,5,14,15,12,9,8,3,6,10,6,8,7,2,5,4,3,1,1

0,1,2,2,4,3,1,4,8,9,5,10,10,3,4,6,7,11,16,6,14,9,11,10,10,7,10,8,8,4,5,8,4,4,5,2,4,1,1,0

0,0,2,3,4,5,4,6,2,9,7,4,9,10,8,11,16,12,15,17,19,10,18,13,15,11,8,4,7,11,6,7,6,5,1,3,1,0,0,0

0,1,1,3,1,4,6,2,8,2,10,3,11,9,13,15,5,15,6,10,10,5,14,15,12,7,4,5,11,4,6,9,5,6,1,1,2,1,2,1

0,0,1,3,2,5,1,2,7,6,6,3,12,9,4,14,4,6,12,9,12,7,11,7,16,8,13,6,7,6,10,7,6,3,1,5,4,3,0,0

0,0,1,2,3,4,5,7,5,4,10,5,12,12,5,4,7,9,18,16,16,10,15,15,10,4,3,7,5,9,4,6,2,4,1,4,2,2,2,1

0,1,2,1,1,3,5,3,6,3,10,10,11,10,13,10,13,6,6,14,5,4,5,5,9,4,12,7,7,4,7,9,3,3,6,3,4,1,2,0

0,1,2,2,3,5,2,4,5,6,8,3,5,4,3,15,15,12,16,7,20,15,12,8,9,6,12,5,8,3,8,5,4,1,3,2,1,3,1,0

0,0,0,2,4,4,5,3,3,3,10,4,4,4,14,11,15,13,10,14,11,17,9,11,11,7,10,12,10,10,10,8,7,5,2,2,4,1,2,1

0,0,2,1,1,4,4,7,2,9,4,10,12,7,6,6,11,12,9,15,15,6,6,13,5,12,9,6,4,7,7,6,5,4,1,4,2,2,2,1

0,1,2,1,1,4,5,4,4,5,9,7,10,3,13,13,8,9,17,16,16,15,12,13,5,12,10,9,11,9,4,5,5,2,2,5,1,0,0,1

0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1

0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1

0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1

0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0

0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0

This is the equivalent to reading all the lines in a book. If we run this again:

for line in file:
    print(line)

There is no output. It is as if we have reached the end of the book and are stuck. We need to close the book and then we could open it and read it again, or choose another book and read that instead:

file.close()

Pythonic reading¶

Having to open, read and close files in this way is common in many programming languages. However typically we just want to read in a whole file for processing in one go. Python has a particular structure that allows us to do this in a very compact form which we can combine with the function readlines to read an entire file with one command:

with open("data/inflammation-01.csv") as file:
    read_file=file.readlines()
        
for line in read_file:
    print(line)

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0

0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1

0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1

0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1

0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

0,0,1,2,2,4,2,1,6,4,7,6,6,9,9,15,4,16,18,12,12,5,18,9,5,3,10,3,12,7,8,4,7,3,5,4,4,3,2,1

0,0,2,2,4,2,2,5,5,8,6,5,11,9,4,13,5,12,10,6,9,17,15,8,9,3,13,7,8,2,8,8,4,2,3,5,4,1,1,1

0,0,1,2,3,1,2,3,5,3,7,8,8,5,10,9,15,11,18,19,20,8,5,13,15,10,6,10,6,7,4,9,3,5,2,5,3,2,2,1

0,0,0,3,1,5,6,5,5,8,2,4,11,12,10,11,9,10,17,11,6,16,12,6,8,14,6,13,10,11,4,6,4,7,6,3,2,1,0,0

0,1,1,2,1,3,5,3,5,8,6,8,12,5,13,6,13,8,16,8,18,15,16,14,12,7,3,8,9,11,2,5,4,5,1,4,1,2,0,0

0,1,0,0,4,3,3,5,5,4,5,8,7,10,13,3,7,13,15,18,8,15,15,16,11,14,12,4,10,10,4,3,4,5,5,3,3,2,2,1

0,1,0,0,3,4,2,7,8,5,2,8,11,5,5,8,14,11,6,11,9,16,18,6,12,5,4,3,5,7,8,3,5,4,5,5,4,0,1,1

0,0,2,1,4,3,6,4,6,7,9,9,3,11,6,12,4,17,13,15,13,12,8,7,4,7,12,9,5,6,5,4,7,3,5,4,2,3,0,1

0,0,0,0,1,3,1,6,6,5,5,6,3,6,13,3,10,13,9,16,15,9,11,4,6,4,11,11,12,3,5,8,7,4,6,4,1,3,0,0

0,1,2,1,1,1,4,1,5,2,3,3,10,7,13,5,7,17,6,9,12,13,10,4,12,4,6,7,6,10,8,2,5,1,3,4,2,0,2,0

0,1,1,0,1,2,4,3,6,4,7,5,5,7,5,10,7,8,18,17,9,8,12,11,11,11,14,6,11,2,10,9,5,6,5,3,4,2,2,0

0,0,0,0,2,3,6,5,7,4,3,2,10,7,9,11,12,5,12,9,13,19,14,17,5,13,8,11,5,10,9,8,7,5,3,1,4,0,2,1

0,0,0,1,2,1,4,3,6,7,4,2,12,6,12,4,14,7,8,14,13,19,6,9,12,6,4,13,6,7,2,3,6,5,4,2,3,0,1,0

0,0,2,1,2,5,4,2,7,8,4,7,11,9,8,11,15,17,11,12,7,12,7,6,7,4,13,5,7,6,6,9,2,1,1,2,2,0,1,0

0,1,2,0,1,4,3,2,2,7,3,3,12,13,11,13,6,5,9,16,9,19,16,11,8,9,14,12,11,9,6,6,6,1,1,2,4,3,1,1

0,1,1,3,1,4,4,1,8,2,2,3,12,12,10,15,13,6,5,5,18,19,9,6,11,12,7,6,3,6,3,2,4,3,1,5,4,2,2,0

0,0,2,3,2,3,2,6,3,8,7,4,6,6,9,5,12,12,8,5,12,10,16,7,14,12,5,4,6,9,8,5,6,6,1,4,3,0,2,0

0,0,0,3,4,5,1,7,7,8,2,5,12,4,10,14,5,5,17,13,16,15,13,6,12,9,10,3,3,7,4,4,8,2,6,5,1,0,1,0

0,1,1,1,1,3,3,2,6,3,9,7,8,8,4,13,7,14,11,15,14,13,5,13,7,14,9,10,5,11,5,3,5,1,1,4,4,1,2,0

0,1,1,1,2,3,5,3,6,3,7,10,3,8,12,4,12,9,15,5,17,16,5,10,10,15,7,5,3,11,5,5,6,1,1,1,1,0,2,1

0,0,2,1,3,3,2,7,4,4,3,8,12,9,12,9,5,16,8,17,7,11,14,7,13,11,7,12,12,7,8,5,7,2,2,4,1,1,1,0

0,0,1,2,4,2,2,3,5,7,10,5,5,12,3,13,4,13,7,15,9,12,18,14,16,12,3,11,3,2,7,4,8,2,2,1,3,0,1,1

0,0,1,1,1,5,1,5,2,2,4,10,4,8,14,6,15,6,12,15,15,13,7,17,4,5,11,4,8,7,9,4,5,3,2,5,4,3,2,1

0,0,2,2,3,4,6,3,7,6,4,5,8,4,7,7,6,11,12,19,20,18,9,5,4,7,14,8,4,3,7,7,8,3,5,4,1,3,1,0

0,0,0,1,4,4,6,3,8,6,4,10,12,3,3,6,8,7,17,16,14,15,17,4,14,13,4,4,12,11,6,9,5,5,2,5,2,1,0,1

0,1,1,0,3,2,4,6,8,6,2,3,11,3,14,14,12,8,8,16,13,7,6,9,15,7,6,4,10,8,10,4,2,6,5,5,2,3,2,1

0,0,2,3,3,4,5,3,6,7,10,5,10,13,14,3,8,10,9,9,19,15,15,6,8,8,11,5,5,7,3,6,6,4,5,2,2,3,0,0

0,1,2,2,2,3,6,6,6,7,6,3,11,12,13,15,15,10,14,11,11,8,6,12,10,5,12,7,7,11,5,8,5,2,5,5,2,0,2,1

0,0,2,1,3,5,6,7,5,8,9,3,12,10,12,4,12,9,13,10,10,6,10,11,4,15,13,7,3,4,2,9,7,2,4,2,1,2,1,1

0,0,1,2,4,1,5,5,2,3,4,8,8,12,5,15,9,17,7,19,14,18,12,17,14,4,13,13,8,11,5,6,6,2,3,5,2,1,1,1

0,0,0,3,1,3,6,4,3,4,8,3,4,8,3,11,5,7,10,5,15,9,16,17,16,3,8,9,8,3,3,9,5,1,6,5,4,2,2,0

0,1,2,2,2,5,5,1,4,6,3,6,5,9,6,7,4,7,16,7,16,13,9,16,12,6,7,9,10,3,6,4,5,4,6,3,4,3,2,1

0,1,1,2,3,1,5,1,2,2,5,7,6,6,5,10,6,7,17,13,15,16,17,14,4,4,10,10,10,11,9,9,5,4,4,2,1,0,1,0

0,1,0,3,2,4,1,1,5,9,10,7,12,10,9,15,12,13,13,6,19,9,10,6,13,5,13,6,7,2,5,5,2,1,1,1,1,3,0,1

0,1,1,3,1,1,5,5,3,7,2,2,3,12,4,6,8,15,16,16,15,4,14,5,13,10,7,10,6,3,2,3,6,3,3,5,4,3,2,1

0,0,0,2,2,1,3,4,5,5,6,5,5,12,13,5,7,5,11,15,18,7,9,10,14,12,11,9,10,3,2,9,6,2,2,5,3,0,0,1

0,0,1,3,3,1,2,1,8,9,2,8,10,3,8,6,10,13,11,17,19,6,4,11,6,12,7,5,5,4,4,8,2,6,6,4,2,2,0,0

0,1,1,3,4,5,2,1,3,7,9,6,10,5,8,15,11,12,15,6,12,16,6,4,14,3,12,9,6,11,5,8,5,5,6,1,2,1,2,0

0,0,1,3,1,4,3,6,7,8,5,7,11,3,6,11,6,10,6,19,18,14,6,10,7,9,8,5,8,3,10,2,5,1,5,4,2,1,0,1

0,1,1,3,3,4,4,6,3,4,9,9,7,6,8,15,12,15,6,11,6,18,5,14,15,12,9,8,3,6,10,6,8,7,2,5,4,3,1,1

0,1,2,2,4,3,1,4,8,9,5,10,10,3,4,6,7,11,16,6,14,9,11,10,10,7,10,8,8,4,5,8,4,4,5,2,4,1,1,0

0,0,2,3,4,5,4,6,2,9,7,4,9,10,8,11,16,12,15,17,19,10,18,13,15,11,8,4,7,11,6,7,6,5,1,3,1,0,0,0

0,1,1,3,1,4,6,2,8,2,10,3,11,9,13,15,5,15,6,10,10,5,14,15,12,7,4,5,11,4,6,9,5,6,1,1,2,1,2,1

0,0,1,3,2,5,1,2,7,6,6,3,12,9,4,14,4,6,12,9,12,7,11,7,16,8,13,6,7,6,10,7,6,3,1,5,4,3,0,0

0,0,1,2,3,4,5,7,5,4,10,5,12,12,5,4,7,9,18,16,16,10,15,15,10,4,3,7,5,9,4,6,2,4,1,4,2,2,2,1

0,1,2,1,1,3,5,3,6,3,10,10,11,10,13,10,13,6,6,14,5,4,5,5,9,4,12,7,7,4,7,9,3,3,6,3,4,1,2,0

0,1,2,2,3,5,2,4,5,6,8,3,5,4,3,15,15,12,16,7,20,15,12,8,9,6,12,5,8,3,8,5,4,1,3,2,1,3,1,0

0,0,0,2,4,4,5,3,3,3,10,4,4,4,14,11,15,13,10,14,11,17,9,11,11,7,10,12,10,10,10,8,7,5,2,2,4,1,2,1

0,0,2,1,1,4,4,7,2,9,4,10,12,7,6,6,11,12,9,15,15,6,6,13,5,12,9,6,4,7,7,6,5,4,1,4,2,2,2,1

0,1,2,1,1,4,5,4,4,5,9,7,10,3,13,13,8,9,17,16,16,15,12,13,5,12,10,9,11,9,4,5,5,2,2,5,1,0,0,1

0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1

0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1

0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1

0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0

0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0

big files!

Note that the above method will read the entire file into your computer's memory. This means that if your file is particularly large it may cause the computer to crash!

This is incredibly powerful and we can now process each of the lines in turn. However even for a standard format like a 'csv' file this is not trivial. First we have to split each line to turn it into a list. At this point all the items in the list are strings as when we used input to read from the keyboard, so each value needs to be turned into a numerical value with int or float. Each individual value then needs to be appended to a list, and finally the data from all lines needs to be assembled into a list of lists. This is given as an exercise at the end of the episode.

For now, wouldn't it just be much easier if someone had written a library that we could use instead?

Numpy¶

There are of course a number of libraries that we can use to read in our data. One of the most useful of these is numpy a numerical library for Python that has a host of features and optimised libraries for performing efficient calculations. In particular, for our purposes, it has a function for reading 'csv' files and converting them automatically into numerical values, if possible. First we must import the library, and according to near universal tradition when we import numpy we use the alias np.

import numpy as np

You do not have to use the alias np, in which case you can just import numpy and write numpy everywhere we use np in the code that follows. However we mention and use it because if you look at anyone else's code it is almost certain that this is how they will use the library. To read in a 'csv' file we can now use the single command:

data = np.loadtxt(fname='data/inflammation-01.csv', delimiter=',')

Let's check that something has happened and that the data has been read in:

print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]

We can see that data contains values and that printing them results in something different from just printing out each line in the file. When print is used with a numpy object and the data is bigger than can be neatly printed, the first and last few values are printed. In between ellipsis is printed to indicate the data that is present in data but ommitted for clarity. To check that all the data has been read in as before we can access each 'line' of the data with:

for line in data:
    print(line)

[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.  3.  3. 10.  5.  7.  4.  7.  7.
 12. 18.  6. 13. 11. 11.  7.  7.  4.  6.  8.  8.  4.  4.  5.  7.  3.  4.
  2.  3.  0.  0.]
[ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6. 10. 11.  5.  9.  4.  4.  7. 16.
  8.  6. 18.  4. 12.  5. 12.  7. 11.  5. 11.  3.  3.  5.  4.  4.  5.  5.
  1.  1.  0.  1.]
[ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.  5.  7.  4.  5.  4. 15.  5. 11.
  9. 10. 19. 14. 12. 17.  7. 12. 11.  7.  4.  2. 10.  5.  4.  2.  2.  3.
  2.  2.  1.  1.]
[ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7. 10.  7.  9. 13.  8.  8. 15. 10.
 10.  7. 17.  4.  4.  7.  6. 15.  6.  4.  9. 11.  3.  5.  6.  3.  3.  4.
  2.  3.  2.  1.]
[ 0.  1.  1.  3.  3.  1.  3.  5.  2.  4.  4.  7.  6.  5.  3. 10.  8. 10.
  6. 17.  9. 14.  9.  7. 13.  9. 12.  6.  7.  7.  9.  6.  3.  2.  2.  4.
  2.  0.  1.  1.]
[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.  7.  6.  6.  9.  9. 15.  4. 16.
 18. 12. 12.  5. 18.  9.  5.  3. 10.  3. 12.  7.  8.  4.  7.  3.  5.  4.
  4.  3.  2.  1.]
[ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.  6.  5. 11.  9.  4. 13.  5. 12.
 10.  6.  9. 17. 15.  8.  9.  3. 13.  7.  8.  2.  8.  8.  4.  2.  3.  5.
  4.  1.  1.  1.]
[ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.  7.  8.  8.  5. 10.  9. 15. 11.
 18. 19. 20.  8.  5. 13. 15. 10.  6. 10.  6.  7.  4.  9.  3.  5.  2.  5.
  3.  2.  2.  1.]
[ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.  2.  4. 11. 12. 10. 11.  9. 10.
 17. 11.  6. 16. 12.  6.  8. 14.  6. 13. 10. 11.  4.  6.  4.  7.  6.  3.
  2.  1.  0.  0.]
[ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.  6.  8. 12.  5. 13.  6. 13.  8.
 16.  8. 18. 15. 16. 14. 12.  7.  3.  8.  9. 11.  2.  5.  4.  5.  1.  4.
  1.  2.  0.  0.]
[ 0.  1.  0.  0.  4.  3.  3.  5.  5.  4.  5.  8.  7. 10. 13.  3.  7. 13.
 15. 18.  8. 15. 15. 16. 11. 14. 12.  4. 10. 10.  4.  3.  4.  5.  5.  3.
  3.  2.  2.  1.]
[ 0.  1.  0.  0.  3.  4.  2.  7.  8.  5.  2.  8. 11.  5.  5.  8. 14. 11.
  6. 11.  9. 16. 18.  6. 12.  5.  4.  3.  5.  7.  8.  3.  5.  4.  5.  5.
  4.  0.  1.  1.]
[ 0.  0.  2.  1.  4.  3.  6.  4.  6.  7.  9.  9.  3. 11.  6. 12.  4. 17.
 13. 15. 13. 12.  8.  7.  4.  7. 12.  9.  5.  6.  5.  4.  7.  3.  5.  4.
  2.  3.  0.  1.]
[ 0.  0.  0.  0.  1.  3.  1.  6.  6.  5.  5.  6.  3.  6. 13.  3. 10. 13.
  9. 16. 15.  9. 11.  4.  6.  4. 11. 11. 12.  3.  5.  8.  7.  4.  6.  4.
  1.  3.  0.  0.]
[ 0.  1.  2.  1.  1.  1.  4.  1.  5.  2.  3.  3. 10.  7. 13.  5.  7. 17.
  6.  9. 12. 13. 10.  4. 12.  4.  6.  7.  6. 10.  8.  2.  5.  1.  3.  4.
  2.  0.  2.  0.]
[ 0.  1.  1.  0.  1.  2.  4.  3.  6.  4.  7.  5.  5.  7.  5. 10.  7.  8.
 18. 17.  9.  8. 12. 11. 11. 11. 14.  6. 11.  2. 10.  9.  5.  6.  5.  3.
  4.  2.  2.  0.]
[ 0.  0.  0.  0.  2.  3.  6.  5.  7.  4.  3.  2. 10.  7.  9. 11. 12.  5.
 12.  9. 13. 19. 14. 17.  5. 13.  8. 11.  5. 10.  9.  8.  7.  5.  3.  1.
  4.  0.  2.  1.]
[ 0.  0.  0.  1.  2.  1.  4.  3.  6.  7.  4.  2. 12.  6. 12.  4. 14.  7.
  8. 14. 13. 19.  6.  9. 12.  6.  4. 13.  6.  7.  2.  3.  6.  5.  4.  2.
  3.  0.  1.  0.]
[ 0.  0.  2.  1.  2.  5.  4.  2.  7.  8.  4.  7. 11.  9.  8. 11. 15. 17.
 11. 12.  7. 12.  7.  6.  7.  4. 13.  5.  7.  6.  6.  9.  2.  1.  1.  2.
  2.  0.  1.  0.]
[ 0.  1.  2.  0.  1.  4.  3.  2.  2.  7.  3.  3. 12. 13. 11. 13.  6.  5.
  9. 16.  9. 19. 16. 11.  8.  9. 14. 12. 11.  9.  6.  6.  6.  1.  1.  2.
  4.  3.  1.  1.]
[ 0.  1.  1.  3.  1.  4.  4.  1.  8.  2.  2.  3. 12. 12. 10. 15. 13.  6.
  5.  5. 18. 19.  9.  6. 11. 12.  7.  6.  3.  6.  3.  2.  4.  3.  1.  5.
  4.  2.  2.  0.]
[ 0.  0.  2.  3.  2.  3.  2.  6.  3.  8.  7.  4.  6.  6.  9.  5. 12. 12.
  8.  5. 12. 10. 16.  7. 14. 12.  5.  4.  6.  9.  8.  5.  6.  6.  1.  4.
  3.  0.  2.  0.]
[ 0.  0.  0.  3.  4.  5.  1.  7.  7.  8.  2.  5. 12.  4. 10. 14.  5.  5.
 17. 13. 16. 15. 13.  6. 12.  9. 10.  3.  3.  7.  4.  4.  8.  2.  6.  5.
  1.  0.  1.  0.]
[ 0.  1.  1.  1.  1.  3.  3.  2.  6.  3.  9.  7.  8.  8.  4. 13.  7. 14.
 11. 15. 14. 13.  5. 13.  7. 14.  9. 10.  5. 11.  5.  3.  5.  1.  1.  4.
  4.  1.  2.  0.]
[ 0.  1.  1.  1.  2.  3.  5.  3.  6.  3.  7. 10.  3.  8. 12.  4. 12.  9.
 15.  5. 17. 16.  5. 10. 10. 15.  7.  5.  3. 11.  5.  5.  6.  1.  1.  1.
  1.  0.  2.  1.]
[ 0.  0.  2.  1.  3.  3.  2.  7.  4.  4.  3.  8. 12.  9. 12.  9.  5. 16.
  8. 17.  7. 11. 14.  7. 13. 11.  7. 12. 12.  7.  8.  5.  7.  2.  2.  4.
  1.  1.  1.  0.]
[ 0.  0.  1.  2.  4.  2.  2.  3.  5.  7. 10.  5.  5. 12.  3. 13.  4. 13.
  7. 15.  9. 12. 18. 14. 16. 12.  3. 11.  3.  2.  7.  4.  8.  2.  2.  1.
  3.  0.  1.  1.]
[ 0.  0.  1.  1.  1.  5.  1.  5.  2.  2.  4. 10.  4.  8. 14.  6. 15.  6.
 12. 15. 15. 13.  7. 17.  4.  5. 11.  4.  8.  7.  9.  4.  5.  3.  2.  5.
  4.  3.  2.  1.]
[ 0.  0.  2.  2.  3.  4.  6.  3.  7.  6.  4.  5.  8.  4.  7.  7.  6. 11.
 12. 19. 20. 18.  9.  5.  4.  7. 14.  8.  4.  3.  7.  7.  8.  3.  5.  4.
  1.  3.  1.  0.]
[ 0.  0.  0.  1.  4.  4.  6.  3.  8.  6.  4. 10. 12.  3.  3.  6.  8.  7.
 17. 16. 14. 15. 17.  4. 14. 13.  4.  4. 12. 11.  6.  9.  5.  5.  2.  5.
  2.  1.  0.  1.]
[ 0.  1.  1.  0.  3.  2.  4.  6.  8.  6.  2.  3. 11.  3. 14. 14. 12.  8.
  8. 16. 13.  7.  6.  9. 15.  7.  6.  4. 10.  8. 10.  4.  2.  6.  5.  5.
  2.  3.  2.  1.]
[ 0.  0.  2.  3.  3.  4.  5.  3.  6.  7. 10.  5. 10. 13. 14.  3.  8. 10.
  9.  9. 19. 15. 15.  6.  8.  8. 11.  5.  5.  7.  3.  6.  6.  4.  5.  2.
  2.  3.  0.  0.]
[ 0.  1.  2.  2.  2.  3.  6.  6.  6.  7.  6.  3. 11. 12. 13. 15. 15. 10.
 14. 11. 11.  8.  6. 12. 10.  5. 12.  7.  7. 11.  5.  8.  5.  2.  5.  5.
  2.  0.  2.  1.]
[ 0.  0.  2.  1.  3.  5.  6.  7.  5.  8.  9.  3. 12. 10. 12.  4. 12.  9.
 13. 10. 10.  6. 10. 11.  4. 15. 13.  7.  3.  4.  2.  9.  7.  2.  4.  2.
  1.  2.  1.  1.]
[ 0.  0.  1.  2.  4.  1.  5.  5.  2.  3.  4.  8.  8. 12.  5. 15.  9. 17.
  7. 19. 14. 18. 12. 17. 14.  4. 13. 13.  8. 11.  5.  6.  6.  2.  3.  5.
  2.  1.  1.  1.]
[ 0.  0.  0.  3.  1.  3.  6.  4.  3.  4.  8.  3.  4.  8.  3. 11.  5.  7.
 10.  5. 15.  9. 16. 17. 16.  3.  8.  9.  8.  3.  3.  9.  5.  1.  6.  5.
  4.  2.  2.  0.]
[ 0.  1.  2.  2.  2.  5.  5.  1.  4.  6.  3.  6.  5.  9.  6.  7.  4.  7.
 16.  7. 16. 13.  9. 16. 12.  6.  7.  9. 10.  3.  6.  4.  5.  4.  6.  3.
  4.  3.  2.  1.]
[ 0.  1.  1.  2.  3.  1.  5.  1.  2.  2.  5.  7.  6.  6.  5. 10.  6.  7.
 17. 13. 15. 16. 17. 14.  4.  4. 10. 10. 10. 11.  9.  9.  5.  4.  4.  2.
  1.  0.  1.  0.]
[ 0.  1.  0.  3.  2.  4.  1.  1.  5.  9. 10.  7. 12. 10.  9. 15. 12. 13.
 13.  6. 19.  9. 10.  6. 13.  5. 13.  6.  7.  2.  5.  5.  2.  1.  1.  1.
  1.  3.  0.  1.]
[ 0.  1.  1.  3.  1.  1.  5.  5.  3.  7.  2.  2.  3. 12.  4.  6.  8. 15.
 16. 16. 15.  4. 14.  5. 13. 10.  7. 10.  6.  3.  2.  3.  6.  3.  3.  5.
  4.  3.  2.  1.]
[ 0.  0.  0.  2.  2.  1.  3.  4.  5.  5.  6.  5.  5. 12. 13.  5.  7.  5.
 11. 15. 18.  7.  9. 10. 14. 12. 11.  9. 10.  3.  2.  9.  6.  2.  2.  5.
  3.  0.  0.  1.]
[ 0.  0.  1.  3.  3.  1.  2.  1.  8.  9.  2.  8. 10.  3.  8.  6. 10. 13.
 11. 17. 19.  6.  4. 11.  6. 12.  7.  5.  5.  4.  4.  8.  2.  6.  6.  4.
  2.  2.  0.  0.]
[ 0.  1.  1.  3.  4.  5.  2.  1.  3.  7.  9.  6. 10.  5.  8. 15. 11. 12.
 15.  6. 12. 16.  6.  4. 14.  3. 12.  9.  6. 11.  5.  8.  5.  5.  6.  1.
  2.  1.  2.  0.]
[ 0.  0.  1.  3.  1.  4.  3.  6.  7.  8.  5.  7. 11.  3.  6. 11.  6. 10.
  6. 19. 18. 14.  6. 10.  7.  9.  8.  5.  8.  3. 10.  2.  5.  1.  5.  4.
  2.  1.  0.  1.]
[ 0.  1.  1.  3.  3.  4.  4.  6.  3.  4.  9.  9.  7.  6.  8. 15. 12. 15.
  6. 11.  6. 18.  5. 14. 15. 12.  9.  8.  3.  6. 10.  6.  8.  7.  2.  5.
  4.  3.  1.  1.]
[ 0.  1.  2.  2.  4.  3.  1.  4.  8.  9.  5. 10. 10.  3.  4.  6.  7. 11.
 16.  6. 14.  9. 11. 10. 10.  7. 10.  8.  8.  4.  5.  8.  4.  4.  5.  2.
  4.  1.  1.  0.]
[ 0.  0.  2.  3.  4.  5.  4.  6.  2.  9.  7.  4.  9. 10.  8. 11. 16. 12.
 15. 17. 19. 10. 18. 13. 15. 11.  8.  4.  7. 11.  6.  7.  6.  5.  1.  3.
  1.  0.  0.  0.]
[ 0.  1.  1.  3.  1.  4.  6.  2.  8.  2. 10.  3. 11.  9. 13. 15.  5. 15.
  6. 10. 10.  5. 14. 15. 12.  7.  4.  5. 11.  4.  6.  9.  5.  6.  1.  1.
  2.  1.  2.  1.]
[ 0.  0.  1.  3.  2.  5.  1.  2.  7.  6.  6.  3. 12.  9.  4. 14.  4.  6.
 12.  9. 12.  7. 11.  7. 16.  8. 13.  6.  7.  6. 10.  7.  6.  3.  1.  5.
  4.  3.  0.  0.]
[ 0.  0.  1.  2.  3.  4.  5.  7.  5.  4. 10.  5. 12. 12.  5.  4.  7.  9.
 18. 16. 16. 10. 15. 15. 10.  4.  3.  7.  5.  9.  4.  6.  2.  4.  1.  4.
  2.  2.  2.  1.]
[ 0.  1.  2.  1.  1.  3.  5.  3.  6.  3. 10. 10. 11. 10. 13. 10. 13.  6.
  6. 14.  5.  4.  5.  5.  9.  4. 12.  7.  7.  4.  7.  9.  3.  3.  6.  3.
  4.  1.  2.  0.]
[ 0.  1.  2.  2.  3.  5.  2.  4.  5.  6.  8.  3.  5.  4.  3. 15. 15. 12.
 16.  7. 20. 15. 12.  8.  9.  6. 12.  5.  8.  3.  8.  5.  4.  1.  3.  2.
  1.  3.  1.  0.]
[ 0.  0.  0.  2.  4.  4.  5.  3.  3.  3. 10.  4.  4.  4. 14. 11. 15. 13.
 10. 14. 11. 17.  9. 11. 11.  7. 10. 12. 10. 10. 10.  8.  7.  5.  2.  2.
  4.  1.  2.  1.]
[ 0.  0.  2.  1.  1.  4.  4.  7.  2.  9.  4. 10. 12.  7.  6.  6. 11. 12.
  9. 15. 15.  6.  6. 13.  5. 12.  9.  6.  4.  7.  7.  6.  5.  4.  1.  4.
  2.  2.  2.  1.]
[ 0.  1.  2.  1.  1.  4.  5.  4.  4.  5.  9.  7. 10.  3. 13. 13.  8.  9.
 17. 16. 16. 15. 12. 13.  5. 12. 10.  9. 11.  9.  4.  5.  5.  2.  2.  5.
  1.  0.  0.  1.]
[ 0.  0.  1.  3.  2.  3.  6.  4.  5.  7.  2.  4. 11. 11.  3.  8.  8. 16.
  5. 13. 16.  5.  8.  8.  6.  9. 10. 10.  9.  3.  3.  5.  3.  5.  4.  5.
  3.  3.  0.  1.]
[ 0.  1.  1.  2.  2.  5.  1.  7.  4.  2.  5.  5.  4.  6.  6.  4. 16. 11.
 14. 16. 14. 14.  8. 17.  4. 14. 13.  7.  6.  3.  7.  7.  5.  6.  3.  4.
  2.  2.  1.  1.]
[ 0.  1.  1.  1.  4.  1.  6.  4.  6.  3.  6.  5.  6.  4. 14. 13. 13.  9.
 12. 19.  9. 10. 15. 10.  9. 10. 10.  7.  5.  6.  8.  6.  6.  4.  3.  5.
  2.  1.  1.  1.]
[ 0.  0.  0.  1.  4.  5.  6.  3.  8.  7.  9. 10.  8.  6.  5. 12. 15.  5.
 10.  5.  8. 13. 18. 17. 14.  9. 13.  4. 10. 11. 10.  8.  8.  6.  5.  5.
  2.  0.  2.  0.]
[ 0.  0.  1.  0.  3.  2.  5.  4.  8.  2.  9.  3.  3. 10. 12.  9. 14. 11.
 13.  8.  6. 18. 11.  9. 13. 11.  8.  5.  5.  2.  8.  5.  3.  5.  4.  1.
  3.  1.  1.  0.]

Now, as before when we read and printed each line of the file, the full data set is visible. The format is slightly different, all the commas have been removed and instead of the original strings, each value has now been converted to a float as indicated by the decimal point.

The use of the loop shows that we can treat it as a list. Each line of the original data can be indexed as though it were a list:

print(data[0])
print(data[17])

[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.  3.  3. 10.  5.  7.  4.  7.  7.
 12. 18.  6. 13. 11. 11.  7.  7.  4.  6.  8.  8.  4.  4.  5.  7.  3.  4.
  2.  3.  0.  0.]
[ 0.  0.  0.  1.  2.  1.  4.  3.  6.  7.  4.  2. 12.  6. 12.  4. 14.  7.
  8. 14. 13. 19.  6.  9. 12.  6.  4. 13.  6.  7.  2.  3.  6.  5.  4.  2.
  3.  0.  1.  0.]

We can access individual items in the dataset with two indices and also use the slice that we applied to lists earlier:

print(data[0][0])
print(data[0][1])
print(data[0][2])
print(data[0][:3])
print(data[17][-3:])

0.0
0.0
1.0
[0. 0. 1.]
[0. 1. 0.]

We can also verify that the value in the dataset has been converted to a numerical type with type:

print(type(data[0][0]))

<class 'numpy.float64'>

This reveals that the value is not simply a float but a special numpy.float, the 64 refers to the amount of memory allocated to the value, we can think of this as how accurately the computer can represent the value.

Processing a string

When we have data in a standard format libraries will generally be avaialble to help us read in files, even for proprietary formats (see e.g. the numpy and pandas libraries). However on occasion we will have to write our own parser, for instance our colleague might pass us a text file of marks from a piece of coursework, with each line of data looking like :

#Firstname Surname Mark1 Mark2 Total
James Grant 33 21 54

Write a function that takes a string of this form, i.e. datum = "James Grant 33 21 54 and returns a list, list = ['James', 'Grant', 33, 21, 54], the line beginning with a # indicates that this is a comment. Remember that for the integer values you will also need to convert them from strings!

Suggestion: Before writing any code write out in natural language each of the steps that your function will perform.

Hint: my_string.split() takes a string and splits it into a list, the default 'delimiter' is whitespace (spaces, tabs newlines), if you need to split on commas or another character instead you would need to specify this as a parameter e.g. comma_separated_string.split(',').

Solution

Re-invent the wheel

We would always advise you to use existing libraries wherever possible, however parsing files is a useful practice ground for the ideas we have been covering. Write a function to read in the csv formatted inflammation data, using a specified filename passed as a string, and returns a 2D list (list of lists) having converted all entries to floats.

Begin by writing in natural language each of the steps that your functions will need to perform. Then implement your function and verify that it produces similar output to the numpy parser we introduced above.