Unleash the power of Python regular expressions

How to use the regular expression functions provided by the ‘re’ library to match, search, and replace text in your Python programs.

Regular expressions, or “regex,” is a system for finding complex patterns in text. Most every major language has support for regular expressions, whether as an add-on library or as a native library function.

Python comes with regex support out of the box, as part of its standard library. Here, we’ll take a quick tour of the Python regular expression library and how to get the most out of it. For more details about regular expressions generally, see this introduction in the Python documentation.

Python regex basics

To start using regexes in Python, simply import the Python regex library, re, which is included as part of Python’s standard library.

import re

The easiest way to use re is to make quick-and-dirty searches, using a particular regex against a particular string. Here is an easy example.

import re
text = 'b213 a13 x15'
print (re.search(r'a\d*\W', text)[0])

Here, we use the regular expression 'a\d*\W', which looks for the letter a followed by any number of digits and then whitespace. re.search() takes that regular expression and looks for the first match against it in the provided string text. In the above example, the match is a13.

When re makes one or more matches, it returns what is called a match object, which is a data structure that contains many details about the match. (More on match objects in a bit.)

Most of the time, if you’re just looking for the first match, you can obtain it simply by indexing into the match object as we’ve done above (with the [0] index).

Finally, note that we used Python raw strings to construct our regex. This is because the syntax of a regex, since it uses backslashes, can conflict with the way ordinary strings are escaped in Python. The r prefix before the string tells the Python interpreter, “Don't interpret the backslashes as escape codes.”

More ways to match using Python regex

re.search() is not the only way to find patterns in text with re, and it’s far from the most flexible. These four other methods available in re might better fit your use case:

  • re.match() is like re.search(), but looks only for matches from the beginning of the string and nowhere else. This is useful if you know you’re not going to be scanning the rest of the string, and you want to optimize the matching method.
  • re.fullmatch() attempts to match the regex against the entire string, and only the entire string. Again, this optimizes the matching strategy in cases where you want an all-or-nothing match.
  • re.finditer() looks for all the matches available, and returns them in the form of a generator. Each iteration of the generator yields up a match object, one for each match found. This is useful if you are working with a large string that might yield a great many matches, and you want to conserve memory by creating and consuming one match object at a time.
  • re.findall() looks for all matches, like re.finditer(), but returns the matches as a simple list. If you don’t want to bother with all the details of working with match objects, you can just use re.findall() to produce a Python list of all the matches found. The downside is that the list is generated all at once, not incrementally, so may not be ideal for large strings that generate many matches.

If you’re looking for a single match, re.match() and re.fullmatch() give you two handy options. If your regular expression is likely to generate many matches, re.finditer() and re.findal() give you two different ways of consuming the results. Here is an example using re.finditer():

import re
text = 'a11 b213 a13 x15 c21 a40 a55 m34'
for match in re.finditer(r'a\d*\W', text):
print (match[0])

Regular expressions use many characters in a way that is specific to the regex syntax, such as the dot (.) or braces ([ and ]). If you want to search for those characters, you will need to escape them with backslashes in your expression. But if you’re working with arbitrary input that you want to escape automatically — for instance, by searching for some user input — you can use re.escape() to transform a string into its regex-escaped version.

Search and replace using Python regex

Regular expressions can also be used to perform search and replace operations, with re.sub(). Here is one example:

import re
text = 'a11 b213 a13 x15 c21 a40 a55 m34'
print (re.sub(r'a(\d*\W)',r'b\1', text))

This regex replaces all occurrences of a followed by any number of digits and then a space with b, followed by those same digits and a space.

Note that search and replace typically makes use of some special features in regular expressions. The parenthetical part of the regex is what is called a “capture group,” which is a way to single out and refer to portions of a match. The replacement string, 'b\1', uses the \1 to refer to the first capture group in the match expression — essentially saying, “Insert the contents of that capture group here.”

Match objects in Python regex

Match objects contain information about a particular regex match — the position in the string where the match was found, the contents of any capture groups for the match, and so on. You can work with match objects using these methods:

  • match.group() returns the match from the string. This would be a15 in our first example.
  • match.start() and match.end() return the start and end indexes of the match. These are the same as the start and stop indexes of a Python slice, so you can use them for exactly that purpose if need be. If you want both at once in a tuple, you can use match.span().
  • match.group(x, y) returns capture groups found in a match. Capture groups let you use parentheses to indicate different parts of a match: match.group() or match.group(0) returns the entire match, match.group(1) returns the first capture group, a combination of arguments (match.group(1,2)) produces a tuple with the contents of the listed capture groups, and match.groups() produces all the capture groups in a single tuple.
  • match.groupdict() returns a dictionary of named capture groups. Normally, capture groups are referred to by an index, but you can assign names to them if you want. match.groupdict() lets you refer to those capture groups by name, as you would the contents of any other dictionary.

Python regex options

When you create a Python regex, you can pass a number of options that control its behavior. Here are some of the most useful:

  • re.IGNORECASE performs case-insensitive matching throughout the regular expression. Normally regexes are case-sensitive, but if you don’t want to manually encode case-insensitivity into your match expression, you can use this option instead.
  • re.MULTILINE changes the way the regular expression handles the tokens for the beginning and end of a string (^ and $, respectively). When enabled, those tokens also match the beginnings and ends of lines within the string. If you are processing text that spans multiple lines and you want to be aware of linebreaks in your regex, use this option.
  • re.DOTALL changes the way the dot (.) character in a regular expression matches text. When enabled, the dot not only matches all text characters, but also newlines.

Here is an example of how these options might be used. Note that these options are essentially values, so they’re passed by combining them with the logical “and” operator (&):

import re
text = 'A11\nb213\na13\nx15\c21\a40\A55\M34'
for match in re.finditer(r'a\d*\W', text, re.IGNORECASE & re.MULTILINE):
print (match[0])

Precompiling Python regexes

If you’re performing a regular expression match only once in the lifetime of a script, using re.search() or re.match() works fine. But if you’re performing many matches in a script, or doing many matches in a loop, there’s a performance cost associated with defining regular expressions over and over again. In cases like this, it makes sense to use a precompiled regex.

To create a precompiled regex, use re.compile(). Pass it a regular expression string, and you’ll get back an object you can use as if it were the re object itself:

import re
strings = ['rally','master','lasso','difficult','easy','manias']
compiled_re = re.compile(r'a.')
for string in strings:
    for match in compiled_re.finditer(string):
        print(match.group())

In this example, we loop through a collection of strings to look in each one for one or more occurences of the letter a and the letter immediately following it, and then loop through all the matches on that string for that regex. Because we’re using the same regex on each iteration of the loop, it makes sense to create the regex object only once and re-use it.

Python regex library

The re library included in Python’s standard library isn’t the only regular expression system available for Python. A third-party library, regex, offers some additional functionality. Regex can, for instance, perform case-insensitive matches in Unicode. Its most significant feature, though, is being able to run concurrently — it can perform matching operations outside of Python’s GIL, so regex operations don’t block other Python threads. For casual use, the conventional re library is fine, but look into using regex if you find yourself performing many matches in tight loops.

Copyright © 2021 IDG Communications, Inc.