Performant Python

Continuing Numpy

Carrying on from yesterday we will continue learning how to manipulate data in numpy before using matplotlib to plot our data.

In [1]:
import numpy as np

Basic data types

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2). This is due to a difference in the data-type used:

In [2]:
a = np.array([1, 2, 3])
a.dtype
Out[2]:
dtype('int64')
In [3]:
b = np.array([1., 2., 3.])
b.dtype
Out[3]:
dtype('float64')

Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input but you can specify it explicitly:

In [4]:
c = np.array([1, 2, 3], dtype=float)
c.dtype
Out[4]:
dtype('float64')

The default data type is floating point.

In [5]:
d = np.ones((3, 3))
d.dtype
Out[5]:
dtype('float64')

There are other data types as well:

In [6]:
e = np.array([1+2j, 3+4j, 5+6*1j])
type(1j)
#e.dtype
Out[6]:
complex
In [7]:
f = np.array([True, False, False, True])
f.dtype
Out[7]:
dtype('bool')
In [8]:
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype     # <--- strings containing max. 7 letters
Out[8]:
dtype('<U7')

We previously came across dtypes when learing about pandas. This is because pandas uses NumPy as its underlying library. A pandas.Series is essentially a np.array with some extra features wrapped around it.

Exercise 1

Recreate some of the arrays we created in yesterday's session and look at what dtype they have.

Why NumPy

To show some of the advantages of NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.

Python provides some tools to make this easier, particularly via the timeit module. Using this functionality, IPython provides a %timeit magic function to make our life easier. To use the %timeit magic, simply put it at the beginning of a line and it will give you information about how ling it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.

We start by making a list and an array of 10000 items each of values counting from 0 to 9999:

In [9]:
python_list = list(range(100000))
numpy_array = np.arange(100000)

We are going to go through each item in the list and double its value in-place, such that the list is changed after the operation. To do this with a Python list we need a for loop:

In [10]:
def python_double(a):
    for i, val in enumerate(a):
        a[i] = val * 2

%timeit python_double(python_list)
10.9 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

To do the same operation in NumPy we can use the fact that multiplying a NumPy array by a value will apply that operation to each of its elements:

In [11]:
def numpy_double(a):
    a *= 2

%timeit numpy_double(numpy_array)
55.4 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As you can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.

Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.

Copies and views

A slicing operation (like reshaping before) creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. This means you can do this to large arrays without any great performance hit. You can use np.may_share_memory() to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.

When modifying the view, the original array is modified as well:

In [12]:
a = np.arange(10)
a
Out[12]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [13]:
b = a[3:7]

np.may_share_memory(a, b)
Out[13]:
True
In [14]:
b[0] = 12
b
Out[14]:
array([12,  4,  5,  6])
In [15]:
a   # (!)
Out[15]:
array([ 0,  1,  2, 12,  4,  5,  6,  7,  8,  9])
In [16]:
a = np.arange(10)
c = a[::2].copy()  # force a copy
c[0] = 12
a
Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [17]:
np.may_share_memory(a, c)  # we made a copy so there is no shared memory
Out[17]:
False

Whether you make a view or a copy can affect the speed of your code significantly. Be in the habit of checking whether your code is doing unnecessacy work. Also, be sure to benchmark your code as you work on it so that you notice any slowdowns and so that you know which parts are slow so you speed the right bits up.

Exercise 2

  • Using %timeit, time how long finding the square roots of a list of numbers would take under both standard Python and numpy.
    • Tip: Python's square root function is math.sqrt. numpy's is np.sqrt.
In [18]:
# Answer

import math

python_list_2 = list(range(100000))

def python_sqrt(a):
    for i, val in enumerate(a):
        a[i] = math.sqrt(val)

%timeit python_sqrt(python_list)
13.2 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [19]:
# Answer

numpy_array_2 = np.arange(100000)

def numpy_sqrt(a):
    np.sqrt(a)

%timeit numpy_sqrt(numpy_array)
192 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)