In the last session we tried to interpret strings as valid heights and weights. This involved looking for text such as "meter" or "kilogram" in the string, and then extracting the number. This process is called pattern matching, and is best undertaken using a regular expression.
Regular expressions have a long history and are available in most programming languages. Python implements a standards-compliant regular expression module, which is called re
.
import re
Let's create a string that contains a height and see if we can use a regular expression to match that...
h = "2 meters"
To search for string "meters" in a string, using re.search
, e.g.
if re.search("meters", h):
print("String contains 'meters'")
else:
print("No match")
re.search
returns a match object if there is a match, or None
if there isn't.
m = re.search("meters", h)
m
This matches "meters", but what about "meter". "meter" is "meters" without an "s". You can specify that a letter is matched 0 or 1 times using "?"
h = "2 meter"
m = re.search("meters?", h)
m
However, this has still not worked, as we match "meters" in the middle of the string. We need to match "meters" only at the end of the string. We do this using "$", which means match at end of string
m = re.search("meters?$", h)
m
We also want to be able to match "m" as well as "meters". To do this, we need to use the "or" operator, which is "|". It is a good idea to put this in round brackets to make both sides of the "or" statement clear.
h = "2 m"
m = re.search("(m|meters?)$", h)
m
Next, we want to match the number, e.g. "X meters", where "X" is a number. You can use "\d" to represent any number. For example
h = "2 meters"
m = re.search("\d (m|meters?)$", h)
m
A problem with the above example is that it only matches a number with a single digit, as "\d" only matches a single number. To match one or more digits, we need to put a "+" afterwards, as this means "match one or more", e.g.
h = "10 meters"
m = re.search("\d+ (m|meters?)$", h)
m
This match breaks if the number is has decimal point, as it doesn't match the "\d". To match a decimal point, you need to use "\.", and also "?", which means "match 0 or 1 decimal points", and then "\d*", which means "match 0 or more digits"
h = "1.5 meters"
m = re.search("\d+\.?\d* (m|meters?)$", h)
m
The number must match at the beginning of the string. We use "^" to mean match at start...
h = "some 1.8 meters"
m = re.search("^\d+\.?\d* (m|meters?)$", h)
m
Finally, we want this match to be case insensitive, and would like the user to be free to use as many spaces as they want between the number and the unit, before the string or after the string... To do this we use "\s*" to represent any number of spaces, and match using re.IGNORECASE
.
h = " 1.8 METers "
m = re.search("^\s*\d+\.?\d*\s*(m|meters?)\s*$", h, re.IGNORECASE)
m
The round brackets do more than just groups parts of your search. They also allow you extract the parts that match.
m.groups()
You can place round brackets around the parts of the match you want to capture. In this case, we want to get the number...
m = re.search("^\s*(\d+\.?\d*)\s*(m|meters?)\s*$", h, re.IGNORECASE)
m.groups()
As m.groups()[0]
contains the match of the first set of round brackets (which is the number), then we can get the number using m.groups()[0]
. This enables us to rewrite the string_to_height
function from the last section as;
def string_to_height(height):
"""Parse the passed string as a height. Valid formats are 'X m', 'X meters' etc."""
m = re.search("^\s*(\d+\.?\d*)\s*(m|meters?)\s*$", height, re.IGNORECASE)
if m:
return float(m.groups()[0])
else:
raise TypeError("Cannot extract a valid height from '%s'" % height)
h = string_to_height(" 1.5 meters ")
h