Occasionally you will want your user to input data via the keyboard, for instance this might be the name of a file to analyse. We can do this using the function input
:
text = input("Enter some text: ")
We can verify that this has been read in correctly by printing the value in text
.
print(text)
The value read in by input is always a string
, we can check this with type
. But let's first run input again but this time enter a number.
text = input("Now trying entering a number: ")
print(text, type(text))
!unzip intro-python-data.zip
You will also see Python
magic which uses a %
instead of !
. These are for specific Python commands rather than system commands.
Most of the time the data we want to read will be in files. In order to read from a file we first need to open
the file:
file = open("data/inflammation-01.csv")
When we open the file it is a bit like picking a book off the shelf and opening it at the first page. We have not yet read in any of the data contained in the file. In order to do this we must read
from the file. In Python and many languages we can do this by reading each line of the file in turn.
line = file.readline()
print(line)
Python allows us to treat the file as a collection of lines which it reads in automatically so we can use the more readable form:
for line in file:
print(line)
This is the equivalent to reading all the lines in a book. If we run this again:
for line in file:
print(line)
There is no output. It is as if we have reached the end of the book and are stuck. We need to close the book and then we could open
it and read it again, or choose another book and read that instead:
file.close()
Having to open
, read
and close
files in this way is common in many programming languages. However typically we just want to read in a whole file for processing in one go. Python has a particular structure that allows us to do this in a very compact form which we can combine with the function readlines
to read an entire file with one command:
with open("data/inflammation-01.csv") as file:
read_file=file.readlines()
for line in read_file:
print(line)
This is incredibly powerful and we can now process each of the lines in turn. However even for a standard format like a 'csv' file this is not trivial. First we have to split
each line to turn it into a list
. At this point all the items in the list are strings
as when we used input
to read from the keyboard, so each value needs to be turned into a numerical value with int
or float
. Each individual value then needs to be appended to a list
, and finally the data from all lines needs to be assembled into a list
of lists
. This is given as an exercise at the end of the episode.
For now, wouldn't it just be much easier if someone had written a library that we could use instead?
There are of course a number of libraries that we can use to read in our data. One of the most useful of these is numpy
a numerical library for Python that has a host of features and optimised libraries for performing efficient calculations. In particular, for our purposes, it has a function for reading 'csv' files and converting them automatically into numerical values, if possible. First we must import the library, and according to near universal tradition when we import numpy
we use the alias np
.
import numpy as np
You do not have to use the alias np
, in which case you can just import numpy
and write numpy
everywhere we use np
in the code that follows. However we mention and use it because if you look at anyone else's code it is almost certain that this is how they will use the library. To read in a 'csv' file we can now use the single command:
data = np.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
Let's check that something has happened and that the data has been read in:
print(data)
We can see that data
contains values and that printing them results in something different from just printing out each line in the file. When print
is used with a numpy object and the data is bigger than can be neatly printed, the first and last few values are printed. In between ellipsis is printed to indicate the data that is present in data
but ommitted for clarity. To check that all the data has been read in as before we can access each 'line' of the data with:
for line in data:
print(line)
Now, as before when we read and printed each line of the file, the full data set is visible. The format is slightly different, all the commas have been removed and instead of the original strings, each value has now been converted to a float
as indicated by the decimal point.
The use of the loop shows that we can treat it as a list. Each line of the original data can be indexed as though it were a list:
print(data[0])
print(data[17])
We can access individual items in the dataset with two indices and also use the slice that we applied to lists earlier:
print(data[0][0])
print(data[0][1])
print(data[0][2])
print(data[0][:3])
print(data[17][-3:])
We can also verify that the value in the dataset has been converted to a numerical type with type
:
print(type(data[0][0]))
This reveals that the value is not simply a float
but a special numpy.float
, the 64
refers to the amount of memory allocated to the value, we can think of this as how accurately the computer can represent the value.