Libraries and Modules in Python

Regular Expressions

Overview:

  • Teaching: 10 min
  • Exercises: 5 min

Questions

  • How can I handle different string formats consistently and with minimal code?
  • I've used * and ? what else can I do with regular expressions?

Objectives

  • Understand how to use regex to handle different formats with minimal coding.

In the last session we tried to interpret strings as valid heights and weights. This involved looking for text such as "meter" or "kilogram" in the string, and then extracting the number. This process is called pattern matching, and is best undertaken using a regular expression.

Regular expressions have a long history and are available in most programming languages. Python implements a standards-compliant regular expression module, which is called re.

In [1]:
import re

Let's create a string that contains a height and see if we can use a regular expression to match that...

In [2]:
h = "2 meters"

To search for string "meters" in a string, using re.search, e.g.

In [3]:
if re.search("meters", h):
    print("String contains 'meters'")
else:
    print("No match")
String contains 'meters'

re.search returns a match object if there is a match, or None if there isn't.

In [4]:
m = re.search("meters", h)
In [5]:
m
Out[5]:
<re.Match object; span=(2, 8), match='meters'>

This matches "meters", but what about "meter". "meter" is "meters" without an "s". You can specify that a letter is matched 0 or 1 times using "?"

In [6]:
h = "2 meter"
In [7]:
m = re.search("meters?", h)
In [8]:
m
Out[8]:
<re.Match object; span=(2, 7), match='meter'>

However, this has still not worked, as we match "meters" in the middle of the string. We need to match "meters" only at the end of the string. We do this using "$", which means match at end of string

In [9]:
m = re.search("meters?$", h)
In [10]:
m
Out[10]:
<re.Match object; span=(2, 7), match='meter'>

We also want to be able to match "m" as well as "meters". To do this, we need to use the "or" operator, which is "|". It is a good idea to put this in round brackets to make both sides of the "or" statement clear.

In [11]:
h = "2 m"
In [12]:
m = re.search("(m|meters?)$", h)
In [13]:
m
Out[13]:
<re.Match object; span=(2, 3), match='m'>

Next, we want to match the number, e.g. "X meters", where "X" is a number. You can use "\d" to represent any number. For example

In [14]:
h = "2 meters"
In [15]:
m = re.search("\d (m|meters?)$", h)
In [16]:
m
Out[16]:
<re.Match object; span=(0, 8), match='2 meters'>

A problem with the above example is that it only matches a number with a single digit, as "\d" only matches a single number. To match one or more digits, we need to put a "+" afterwards, as this means "match one or more", e.g.

In [17]:
h = "10 meters"
In [18]:
m = re.search("\d+ (m|meters?)$", h)
In [19]:
m
Out[19]:
<re.Match object; span=(0, 9), match='10 meters'>

This match breaks if the number is has decimal point, as it doesn't match the "\d". To match a decimal point, you need to use "\.", and also "?", which means "match 0 or 1 decimal points", and then "\d*", which means "match 0 or more digits"

In [20]:
h = "1.5 meters"
In [21]:
m = re.search("\d+\.?\d* (m|meters?)$", h)
In [22]:
m
Out[22]:
<re.Match object; span=(0, 10), match='1.5 meters'>

The number must match at the beginning of the string. We use "^" to mean match at start...

In [23]:
h = "some 1.8 meters"
In [24]:
m = re.search("^\d+\.?\d* (m|meters?)$", h)
In [25]:
m

Finally, we want this match to be case insensitive, and would like the user to be free to use as many spaces as they want between the number and the unit, before the string or after the string... To do this we use "\s*" to represent any number of spaces, and match using re.IGNORECASE.

In [26]:
h = "   1.8 METers   "
In [27]:
m = re.search("^\s*\d+\.?\d*\s*(m|meters?)\s*$", h, re.IGNORECASE)
In [28]:
m
Out[28]:
<re.Match object; span=(0, 16), match='   1.8 METers   '>

The round brackets do more than just groups parts of your search. They also allow you extract the parts that match.

In [29]:
m.groups()
Out[29]:
('METers',)

You can place round brackets around the parts of the match you want to capture. In this case, we want to get the number...

In [30]:
m = re.search("^\s*(\d+\.?\d*)\s*(m|meters?)\s*$", h, re.IGNORECASE)
In [31]:
m.groups()
Out[31]:
('1.8', 'METers')

As m.groups()[0] contains the match of the first set of round brackets (which is the number), then we can get the number using m.groups()[0]. This enables us to rewrite the string_to_height function from the last section as;

In [32]:
def string_to_height(height):
    """Parse the passed string as a height. Valid formats are 'X m', 'X meters' etc.""" 
    m = re.search("^\s*(\d+\.?\d*)\s*(m|meters?)\s*$", height, re.IGNORECASE)
    
    if m:
        return float(m.groups()[0])
    else:
        raise TypeError("Cannot extract a valid height from '%s'" % height)
In [33]:
h = string_to_height("   1.5    meters   ")
In [34]:
h
Out[34]:
1.5

Exercises

1

Rewrite your string_to_weight function using regular expressions. Check that it responds correctly to a range of valid and invalid weights.

Solution

2

Update string_to_height so that it can also understand heights in both meters and centimeters (returning the height in meters), and update string_to_weight so that it can also understand weights in both grams and kilograms (returning the weight in kilograms). Note that you may find it easier to separate the number from the units. You can do this using the below function to divide the string into the number and units. This uses "\w" to match any word character.

Solution

Key Points:

  • Regular expressions are a powerful way of searching strings
  • Appropriate use of regex allows you to write single (but complex ) code to manipulate or extract values from strings