Python RegEx

Introduction

Python RegEx (Regular Expression) is a specific sequence of letters that searches for a string or collection of strings using a search pattern. It can identify the presence or absence of text by matching it with a certain pattern, and it can also divide a pattern into one or more sub-patterns. Python has a re package that allows you to use regex in Python. Its primary job is to perform a search using a regular expression and a string. It either returns the first match or none at all.

Meta Characters Used in RegEx

Meta Characters are valuable, important, and will be utilised in python module re functions to understand the RE analogy. The following is a list of metacharacters.

MetaCharacterDescription
\It is Use to drop the special meaning of character following it
[]Represent a character class
^Matches the beginning
$It Matches the end
.Matches any character except newline
|Means OR (Matches with any of the characters separated by it.
?Matches zero or one occurrence
*Any number of occurrences (including 0 occurrences)
+One or more occurrences
{}Indicate the number of occurrences of a preceding regex to match.
()Enclose a group of Regex

MetaCharacter: Backslash “\”

The backslash (\) ensures that the character is not processed differently. This may be thought of as a method of avoiding metacharacters. For example, if you search for dot(.) in a string, you will discover that dot(.) is considered as a special character, as is one of the metacharacters (as shown in the above table). So, in this scenario, we’ll employ the backslash(\) right before the dot(.) to remove its speciality. For a better understanding, consider the example below.

Code Example:

import re

string = 'softhunt.net'

# without using \
match = re.search(r'.', string)
print(match)

# using \
match = re.search(r'\.', string)
print(match)

Output:

<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(8, 9), match='.'>

MetaCharacter: Square Brackets “[]”

Square Brackets ([]) represent a character class made up of a collection of characters that we want to match. The character class [abc], for example, will match any single a, b, or c.

We can also specify a range of characters using – inside the square brackets. For example,

  • [0, 3] is sample as [0123]
  • [a-c] is same as [ABC]

We can also invert the character class using the caret(^) symbol. For example,

  • [^0-3] means any number except 0, 1, 2, or 3
  • [^a-c] means any character except a, b, or c

MetaCharacter: Caret “^”

Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts with the given character(s) or not. For example,

  • ^s will check if the string starts with s such as softhunt, sony, science, etc.
  • ^so will check if the string starts with so such as softhunt, sony, solona etc.

MetaCharacter: Dollar “$”

Dollar($) symbol matches the end of the string i.e checks whether the string ends with the given character(s) or not. For example,

  • t$ will check for the string that ends with t such as softhunt, Ankit, Ranjeet, etc.
  • nt$ will check for the string that ends with nt such as softhunt, acknowledgement, entertainment, etc.

MetaCharacter: Dot “.”

Dot(.) symbol matches only a single character except for the newline character (\n). For example,

  • a.b will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc
  • .. will check if the string contains at least 2 characters

MetaCharacter: Or “|”

Or symbol works as the or operator meaning it checks whether the pattern before or after the or symbol is present in the string or not. For example

  • a|b will match any string that contains a or b such as acd, bcd, abcd, etc.

MetaCharacter: Question Mark “?”

Question mark(?) checks if the string before the question mark in the regex occurs at least once or not at all. For example,

  • ab?c will be matched for the string ac, acb, dabc but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.

MetaCharacter: Star “*”

Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol. For example,

  • ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched for abdc because b is not followed by c.

MetaCharacter: Plus “+”

Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol. For example,

  • ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and b is not followed by c in abdc.

MetaCharacter: Braces “{}”

Braces match any repetitions preceding regex from m to n both inclusive. For example,

  • a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.

MetaCharacter: Group “()”

Group symbol is used to group sub-patterns. For example,

  • (a|b)cd will match for strings like acd, abcd, gacd, etc.

Python RegEx Special Characters

Special CharacterDescription
\AMatches if the string begins with the given character
\bMatches if the word begins or ends with the given character. \b(string) will check for the beginning of the word and (string)\b will check for the ending of the word.
\BIt is the opposite of the \b i.e. the string should not start or end with the given regex.
\dMatches any decimal digit, this is equivalent to the set class [0-9]
\DMatches any non-digit character, this is equivalent to the set class [^0-9]
\sIt Matches any whitespace character.
\SMatches any non-whitespace character
\wMatches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
\WIt Matches any non-alphanumeric character.
\ZMatches if the string ends with the given regex

Python RegEx Library

Regular expressions in Python are handled by a module called re. Using the import statement, we can import this module.

import re

Let’s see various functions provided by this module to work with regex in Python.

Python RegEx Findall – re.findall()

As a list of strings, return all non-overlapping matches of pattern in string. The string is scanned from left to right, and matches are returned in the order they were discovered.

# A Python program to demonstrate working of findall()
import re

# A sample text string where regular expression is searched.
string = """Todays date is 27 , month is 05 and year is 2022"""

# A sample regular expression to find digits.
regex = '\d+'

match = re.findall(regex, string)
print(match)

Output:

['27', '05', '2022']

Python RegEx Compile – re.compile()

Regular expressions are built into pattern objects, which have methods for searching for pattern matches and executing string replacements.

Code Example 01:

# Module Regular Expression is imported
import re

# compile() creates regular expression character class [a-d], which is equivalent to [abcd].
# class [abcd] will match with string with 'a', 'b', 'c', 'd'.
p = re.compile('[a-e]')

# findall() searches for the Regular Expression nd return a list upon finding
print(p.findall("Hello, Welcome to Softhunt.net Tutorial Website"))

Output:

['e', 'e', 'c', 'e', 'e', 'a', 'e', 'b', 'e']

Code Example 02: Set class [\s,.] will match any whitespace character ‘,’  or ‘.’ 

import re

# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I went to him at 9 A.M. on 5th March 1999"))

# \d+ will match a group on [0-9], group
# of one or greater size
p = re.compile('\d+')
print(p.findall("I went to him at 9 A.M. on 5th July 1999"))

Output:

['9', '5', '1', '9', '9', '9']
['9', '5', '1999']

Code Example 03:

import re

# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))

# \w+ matches to group of alphanumeric character.
p = re.compile('\w+')
print(p.findall("I went to him at 9 A.M., he \
said *** in some_language."))

# \W matches to non alphanumeric characters.
p = re.compile('\W')
print(p.findall("he said *** in some_language."))

Output:

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '9', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']

Code Example 04:

import re

# '*' replaces the no. of occurrence of a character.
p = re.compile('so*')
print(p.findall("softhuntsamsungsolonaspidersonysorry"))

Output:

['so', 's', 's', 'so', 's', 'so', 'so']

Python RegEx Split – re.split()

Split a string by the number of occurrences of a character or a pattern; when that pattern is found, the remaining characters in the string are returned as part of the resulting list.

Syntax:

re.split(pattern, string, maxsplit=0, flags=0)

The first argument, pattern, specifies the regular expression, the string is the given string in which the pattern will be looked for and splitting happens, maxsplit is assumed to be zero ‘0’ if not provided, and if any nonzero number is provided, then at most that many splits occur. If maxsplit = 1, the string will only be split once, resulting in a list of length 2. Flags are highly important and can assist to reduce code; however, they are not required parameters; for example, flags = re.IGNORECASE, in this split, the case, i.e. lowercase or uppercase, will be ignored.

Code Example 01:

from re import split

# '\W+' denotes Non-Alphanumeric Characters or group of characters Upon finding ',' or whitespace ' ', the split(), splits the string from that point
print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))

# Here ':', ' ' ,',' are not AlphaNumeric thus, the point where splitting occurs
print(split('\W+', 'On 5th Jan 1999, at 11:02 AM'))

# '\d+' denotes Numeric Characters or group of characters Splitting occurs at '5', '1999','11', '02' only
print(split('\d+', 'On 5th Jan 1999, at 11:02 AM'))

Output:

['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '5th', 'Jan', '1999', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']

Code Example 02:

import re

# Splitting will occurs only once, at '05', returned list will have length 2
print(re.split('\d+', 'On 05th Jan 1999, at 11:02 AM', 1))

# 'Boy' and 'boy' will be treated same when flags = re.IGNORECASE
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))

Output:

['On ', 'th Jan 1999, at 11:02 AM']
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']

Python RegEx SubString – re.sub()

The ‘sub’ is the function that stands for SubString; a certain regular expression pattern is searched in the provided string (3rd parameter), and once found, the substring pattern is replaced by repl (2nd parameter), count checks and keeps track of how many times this occurs.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

Code Example 01:

import re

# Regular Expression pattern 'ub' matches the string at "Subject" and "Uber". As the CASE has been ignored, using Flag, 'ub' should match twice with the string Upon matching, 'ub' is replaced by '~*' in "Subject", and in "Uber", 'Ub' is replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already',
			flags=re.IGNORECASE))

# Consider the Case Sensitivity, 'Ub' in "Uber", will not be replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already'))

# As count has been given value 1, the maximum times replacement occurs is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
			count=1, flags=re.IGNORECASE))

# 'r' before the pattern denotes RE, \s is for start and end of a String.
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
			flags=re.IGNORECASE))

Output:

S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam

Python RegEx Subn – re.subn()

Except for the manner it outputs, subn() is identical to sub() in every aspect. Rather than simply the string, it produces a tuple including the sum of the replacement and the new string.

Syntax:

re.subn(pattern, repl, string, count=0, flags=0)

Code Example 01:

import re

print(re.subn('ub', '~*', 'Subject has Uber booked already'))

t = re.subn('ub', '~*', 'Subject has Uber booked already',
			flags=re.IGNORECASE)
print(t)
print(len(t))

# This will give same output as sub() would have
print(t[0])

Output:

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already

Python RegEx Escape – re.escape()

Returns a string with all non-alphanumerics backslashed; helpful for matching an arbitrary literal string that may contain regular expression metacharacters.

Syntax:

re.escape(string)

Code Example 01:

import re

# escape() returns a string with BackSlash '\', before every Non-Alphanumeric Character In 1st case only ' ', is not alphanumeric In 2nd case, ' ', caret '^', '-', '[]', '\' are not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

Output:

This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW

Python RegEx Search – re.search()

If the pattern does not match, this function returns None, else it produces a re.MatchObject with information about the matching section of the string. Because this function terminates after the first match, it is better suited for checking a regular expression rather than retrieving data.

Code Example 01: Searching for an occurrence of the pattern

# A Python program to demonstrate working of re.match().
import re

# Lets use a regular expression to match a date string in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on march 5")

if match != None:

	# We reach here when the expression "([a-zA-Z]+) (\d+)" matches the date string. This will print [14, 21), since it matches at index 14 and ends at 21.
	print ("Match at index %s, %s" % (match.start(), match.end()))

	# We us group() method to get all the matches and captured groups. The groups contain the matched values. In particular: match.group(0) always returns the fully matched string match.group(1) match.group(2), ... return the capture groups in order from left to right in the input string match.group() is equivalent to match.group(0) So this will print "march 5"
	print ("Full match: %s" % (match.group(0)))

	# So this will print "march"
	print ("Month: %s" % (match.group(1)))

	# So this will print "5"
	print ("Day: %s" % (match.group(2)))

else:
	print ("The regex pattern does not match.")

Output:

Match at index 14, 21
Full match: march 5
Month: march
Day: 5

Match Object

A Match object contains all of the information about the search and the result, and None is delivered if no match is discovered. Let’s look at some of the match object’s most regularly used methods and properties.

match.re attribute returns the regular expression passed and match.string attribute returns the string passed.

Code Example 01: Getting the string and the regex of the matched object

import re
 
s = "Welcome to Softhunt"
 
# here x is the match object
res = re.search(r"\bS", s)
 
print(res.re)
print(res.string)

Output:

re.compile('\\bS')
Welcome to Softhunt

Code Example 02: Getting index of matched object

  • start() method returns the starting index of the matched substring
  • end() method returns the ending index of the matched substring
  • span() method returns a tuple containing the starting and the ending index of the matched substring
import re
 
s = "Welcome to Softhunt"
 
# here x is the match object
res = re.search(r"\bS", s)
 
print(res.start())
print(res.end())
print(res.span())

Output:

11
12
(11, 12)

Conclusion

That’s all for this article, if you have any confusion contact us through our website or email us at [email protected] or by using LinkedIn

Suggested Articles:

  1. Python Try Except

Leave a Comment