As always we must:
import numpy as np
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2.
vs 2
). This is due to a difference in the data-type used:
a = np.array([1, 2, 3])
a.dtype
b = np.array([1., 2., 3.])
b.dtype
We met the concept of type
in "Introduction to Python".
Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input but you can specify it explicitly:
c = np.array([1, 2, 3], dtype=float)
c.dtype
The default data type is floating point.
d = np.ones((3, 3))
d.dtype
There are other data types as well:
e = np.array([1+2j, 3+4j, 5+6*1j])
e.dtype
f = np.array([True, False, False, True])
f.dtype
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype # <--- strings containing max. 7 letters
To show some of the advantages of NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.
Python provides some tools to make this easier, particularly via the timeit
module. Using this functionality, IPython and Jupyter provides a %timeit
"magic" function to make our life easier. To use the %timeit
magic, simply put it at the beginning of a line and it will give you information about how long it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.
We start by making a list and an array of 10000 items each of values counting from 0 to 9999:
python_list = list(range(100000))
numpy_array = np.arange(100000)
We are going to go through each item in the list and double its value in-place using a function, such that the list is changed after the operation. To do this with a Python list
we need a for
loop:
def python_double(a):
for i, val in enumerate(a):
a[i] = val * 2
%timeit python_double(python_list)
On this machine, this takes of the order 10s of milli-seconds to execute a loop.
To do the same operation in NumPy we can use the fact that multiplying a NumPy array
by a value will apply that operation to each of its elements:
def numpy_double(a):
a *= 2
%timeit numpy_double(numpy_array)
On this machine, this takes of the order 100s of micro-seconds to execute a loop. As you can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.
Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.
We saw slicing in the "Introduction to Python" lesson as a way to view parts of a list.
In NumPy, a slicing operation (like reshaping before) creates a view of the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. This means you can do this to large arrays without any great performance hit. You can use np.may_share_memory()
to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.
When modifying the view, the original array is modified as well:
a = np.arange(10)
a
b = a[3:7]
np.may_share_memory(a, b)
b[0] = 12
b
a # (!)
To avoid the view sharing memory, we can explicitly create a copy, using the .copy()
method.
a = np.arange(10)
c = a[::2].copy() # force a copy
c[0] = 12
a
np.may_share_memory(a, c) # we made a copy so there is no shared memory
Whether you make a view or a copy can affect the speed of your code significantly. Be in the habit of checking whether your code is doing unnecessacy work. Also, be sure to benchmark your code as you work on it so that you notice any slowdowns and so that you know which parts are slow so you speed the right bits up.