Introduction to Data and Plotting

A bit more NumPy

Overview

  • Teaching: 10 min
  • Exercises: 10 min

Questions

  • Why do I need NumPy arrays?
  • What else can NumPy do?

Objectives

  • Learn about the data types that NumPy uses.
  • See that NumPy is often faster than vanilla Python.
  • Learn about copies, views and slices.

As always we must:

In [2]:
import numpy as np

Data types

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2). This is due to a difference in the data-type used:

In [3]:
a = np.array([1, 2, 3])
a.dtype
Out[3]:
dtype('int64')
In [4]:
b = np.array([1., 2., 3.])
b.dtype
Out[4]:
dtype('float64')

We met the concept of type in "Introduction to Python".

Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input but you can specify it explicitly:

In [5]:
c = np.array([1, 2, 3], dtype=float)
c.dtype
Out[5]:
dtype('float64')

The default data type is floating point.

In [6]:
d = np.ones((3, 3))
d.dtype
Out[6]:
dtype('float64')

There are other data types as well:

In [7]:
e = np.array([1+2j, 3+4j, 5+6*1j])
e.dtype
Out[7]:
dtype('complex128')
In [8]:
f = np.array([True, False, False, True])
f.dtype
Out[8]:
dtype('bool')
In [9]:
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype     # <--- strings containing max. 7 letters
Out[9]:
dtype('<U7')

dtypes

Recreate some of the arrays we created in the previous lesson and look at what dtype they have. Try looking at the solutions to the exercise "Different arrays".

Solution

Why NumPy?

To show some of the advantages of NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.

Python provides some tools to make this easier, particularly via the timeit module. Using this functionality, IPython and Jupyter provides a %timeit "magic" function to make our life easier. To use the %timeit magic, simply put it at the beginning of a line and it will give you information about how long it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.

We start by making a list and an array of 10000 items each of values counting from 0 to 9999:

In [16]:
python_list = list(range(100000))
numpy_array = np.arange(100000)

We are going to go through each item in the list and double its value in-place using a function, such that the list is changed after the operation. To do this with a Python list we need a for loop:

In [17]:
def python_double(a):
    for i, val in enumerate(a):
        a[i] = val * 2

%timeit python_double(python_list)
16.2 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

On this machine, this takes of the order 10s of milli-seconds to execute a loop.

To do the same operation in NumPy we can use the fact that multiplying a NumPy array by a value will apply that operation to each of its elements:

In [18]:
def numpy_double(a):
    a *= 2

%timeit numpy_double(numpy_array)
97.1 µs ± 926 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

On this machine, this takes of the order 100s of micro-seconds to execute a loop. As you can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.

Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.

enumerate()

We briefly introduced the function enumerate() earlier. How can you find out what it does?

Need for Speed

Using %timeit, time how long finding the square roots of a list of numbers would take under both standard Python and NumPy.

Hint: Python's square root function is math.sqrt. NumPy's is np.sqrt.

Solution

Copies and views

We saw slicing in the "Introduction to Python" lesson as a way to view parts of a list.

In NumPy, a slicing operation (like reshaping before) creates a view of the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. This means you can do this to large arrays without any great performance hit. You can use np.may_share_memory() to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.

When modifying the view, the original array is modified as well:

In [19]:
a = np.arange(10)
a
Out[19]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [20]:
b = a[3:7]

np.may_share_memory(a, b)
Out[20]:
True
In [21]:
b[0] = 12
b
Out[21]:
array([12,  4,  5,  6])
In [22]:
a   # (!)
Out[22]:
array([ 0,  1,  2, 12,  4,  5,  6,  7,  8,  9])

To avoid the view sharing memory, we can explicitly create a copy, using the .copy() method.

In [16]:
a = np.arange(10)
c = a[::2].copy()  # force a copy
c[0] = 12
a
Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [17]:
np.may_share_memory(a, c)  # we made a copy so there is no shared memory
Out[17]:
False

Whether you make a view or a copy can affect the speed of your code significantly. Be in the habit of checking whether your code is doing unnecessacy work. Also, be sure to benchmark your code as you work on it so that you notice any slowdowns and so that you know which parts are slow so you speed the right bits up.

Key Points

  • NumPy arrays consist of values that are all the same type (or dtype).
  • Python lists do not have to be all the same type.
  • NumPy is more often faster than Python, partially due to arrays being of the same type, partialy due to running more optimised and compiled code.
  • NumPy arrays may share memory, so a change to one, may change another.