Carrying on from yesterday we will continue learning how to manipulate data in numpy
before using matplotlib
to plot our data.
import numpy as np
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2.
vs 2
). This is due to a difference in the data-type used:
a = np.array([1, 2, 3])
a.dtype
b = np.array([1., 2., 3.])
b.dtype
Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input but you can specify it explicitly:
c = np.array([1, 2, 3], dtype=float)
c.dtype
The default data type is floating point.
d = np.ones((3, 3))
d.dtype
There are other data types as well:
e = np.array([1+2j, 3+4j, 5+6*1j])
type(1j)
#e.dtype
f = np.array([True, False, False, True])
f.dtype
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype # <--- strings containing max. 7 letters
We previously came across dtype
s when learing about pandas
. This is because pandas
uses NumPy as its underlying library. A pandas.Series
is essentially a np.array
with some extra features wrapped around it.
Recreate some of the arrays we created in yesterday's session and look at what dtype they have.
To show some of the advantages of NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.
Python provides some tools to make this easier, particularly via the timeit
module. Using this functionality, IPython provides a %timeit
magic function to make our life easier. To use the %timeit
magic, simply put it at the beginning of a line and it will give you information about how ling it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.
We start by making a list and an array of 10000 items each of values counting from 0 to 9999:
python_list = list(range(100000))
numpy_array = np.arange(100000)
We are going to go through each item in the list and double its value in-place, such that the list is changed after the operation. To do this with a Python list
we need a for
loop:
def python_double(a):
for i, val in enumerate(a):
a[i] = val * 2
%timeit python_double(python_list)
To do the same operation in NumPy we can use the fact that multiplying a NumPy array
by a value will apply that operation to each of its elements:
def numpy_double(a):
a *= 2
%timeit numpy_double(numpy_array)
As you can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.
Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.
A slicing operation (like reshaping before) creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. This means you can do this to large arrays without any great performance hit. You can use np.may_share_memory()
to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.
When modifying the view, the original array is modified as well:
a = np.arange(10)
a
b = a[3:7]
np.may_share_memory(a, b)
b[0] = 12
b
a # (!)
a = np.arange(10)
c = a[::2].copy() # force a copy
c[0] = 12
a
np.may_share_memory(a, c) # we made a copy so there is no shared memory
Whether you make a view or a copy can affect the speed of your code significantly. Be in the habit of checking whether your code is doing unnecessacy work. Also, be sure to benchmark your code as you work on it so that you notice any slowdowns and so that you know which parts are slow so you speed the right bits up.
%timeit
, time how long finding the square roots of a list of numbers would take under both standard Python and numpy.math.sqrt
. numpy's is np.sqrt
.# Answer
import math
python_list_2 = list(range(100000))
def python_sqrt(a):
for i, val in enumerate(a):
a[i] = math.sqrt(val)
%timeit python_sqrt(python_list)
# Answer
numpy_array_2 = np.arange(100000)
def numpy_sqrt(a):
np.sqrt(a)
%timeit numpy_sqrt(numpy_array)