Saturday, July 27, 2024

Home Podcasts Python RegEx Operations

Python RegEx Operations

By

Aniruddha Kalbande

-

September 26, 2020

736

Table of Contents

Introduction To Python RegEx

The concept by American mathematician Stephen Cole Kleene in 1951. He describes a regular language using his mathematical notation called regular events.

A Python RegEx expression is a special sequence of characters that defines a pattern for complex string-matching functionality.

There are three regular expressions in python such as regexp, regex, and re. The regular expression also called (RE’s, or regexes, or regex pattern) are highly essentially programming language embedded inside python. Using this function all possible strings match as per our requirements. The regular expression language is relatively small and limited, so not all possible string processing tasks can be done using this function. Now you can learn how to define and manipulate string objects.

One simple technique in the python module is used to match the strings.

If two string is equal, Using equality(==) operator.

Application

Used in Search Engines
Search and Replace dialogs of word processors
Text editors

re Module:

Python has a built-in package called re, also called Regular Expressions. There are so many functions in the re module to work with Python RegEx.

Import re module

If two string is equal, Using equality(==) operator.

Character	String	Matched
	x	1 match
[xyz]	xy	2 match
	Hey man	No match
	xyz yz yx	7 match

You can also specify a range function using (-) inside a square bracket.

For example:

[p – t] = [ pqrst ].
[5-10] = [5678910].

You can also complement the character using invert(^), at a start of the square bracket.

For example:

[^xyz] = means any character except x or y or z.

[^0-9] = means non-digit character.

# square bracket

sample = "Fireblaze AI School"

#Find all lower case characters alphabetically between "a" and "m":

sample_square= re.findall("[a-m]", sample)
print(sample_square)

. – Period

Match ‘any single’ character.

Character	String	Matched
	x	no match
..	xy	1 match
	xyz	1 match
	wxyz	2 match

^ –Caret

Used for ‘start with’ a character.

Character	String	Matched
	x	1 match
^x	xy	1 match
	zyx	No match
^xy	xyz	1 match
	zyx	No match

sample = "Fireblaze AI School"

#Check if the string starts with 'hello':

x = re.findall("^Fireblaze", sample)
if x:
  print("Yes, the string starts with 'Fireblaze'")
else:
  print("No match")

$-Dollar

Used for ‘end with’ a character.

Character	String	Matched
	x	1 match
x$	Manx	1 match
	Hey man	No match

import re

sample = "Fireblaze AI School"

#Check if the string ends with 'world':

x = re.findall("School$", sample)
if x:
  print("Yes, the string ends with 'School'")
else:
  print("No match")

*-Star

Star symbol matches zero or more occurrences of the pattern.

Character	String	Matched
	gi	1 match
gir*l	girl	1 match
	perl	No match

+ -Plus

plus symbol matches one or more occurrences of the pattern.

Character	String	Matched
	xa	No match(no m character)
xman	Man	1 match
	xmaaan	1 match

import re

txt = "Fireblaze AI School"

#Check if the string contains "ai" followed by 1 or more "x" characters:

x = re.findall("Schoo", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['Schoo']
Yes, there is at least one match!

? – Question Mark

The question symbol matches zero or one occurrence of the pattern.

Character	String	Matched
	xa	No match(no m character)
xma?n	Man	1 match
	xmaaan	No match(more than one a)
	xmn	No match (m not followed by a)

{} – Braces:

Consider the {n,m}. This means at least n, and at most m repetitions of the pattern

Character	String	Matched
	pqr xyz	No match
x{2,3}	pqr xyyz	1 match (at xyyz)
	ppqr xyyyz	2 matches(at pp and yyy)
	ppqr xyyyyz	3 matches(at pp and yyyy)

braces
import re

sample = "Fireblaze AI School"

#Check if the string contains "a" followed by exactly two "l" characters:

x = re.findall("aze{2}", sample)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
[]
No match

| – Alteration

The special character standing or vertical bar is used for alteration. The standing bar also works as ‘or’ operation.

Expression	String	Matched
	pqr	No match
x\|y	xaz	1 match
	wxypyz	2 matches

() – Group

Parentheses symbol is used to group.

For example, (x|y|z)ab match by any string-like, x, y, z, a, b.

Expression	String	Matched
	xy ab	No match
(x\|y\|z)ab	xyab	1 match (match at yab)
	xay cabxy	2 matches

\- Backlash

Used for escape various characters including all metacharacters.

\$x match if a string contains $ followed by x.

If you not sure about any character, you can simply put \ in front of it.

backlash
mport re

sample = "That will be 123 rupees"

#Find all digit characters:

x = re.findall("\d", sample)
print(x)

import re

sample = "That will be 123 rupees"

#Find all digit characters:

x = re.findall("\d", sample)
print(x)

['1', '2', '3']

Special Sequences

The special character used for easy to write a pattern.

Here following ist of special character,

\A, \B, \b, \D, \d, \S, \s, \W, \w, \Z.

\A – matches if the character is at the start of a string.

Expression	String	Matched
	man has	match
\Aman	in man	No match

\B – matches if the specific characters are not at the beginning or end of the end.

Expression	String	Matched
	football	No match
\Bfoo	A football	No match
	afootball	match

\b – opposite of \B, matches if the specific character are at the beginning or end of the word.

Expression	String	Matched
	football	match
\Bfoo	A football	match
	afootball	No match

\D –

Matches any non-decimal digit. Same as [^0-9]

Expression	String	Matched
	2xy56”90	3 matches (except digit)
\D	9876	No match

\d –

Opposite of \D, means decimal digit.

Expression	String	Matched
	54xyz3	3 match (digit)
\d	Data science	No match

\S –

Matches where a string contains any non-whitespace.

It is similar to [^ \t\n\r\f\v]

Expression	String	Matched
	x y	2 match
\S		No match

\s –

Matches where a string contains any whitespace.

It is similar to [^ \t\n\r\f\v]

Expression	String	Matched
	Machine Learning	1 match
\s	MachineLearning	No match

\W –

Matches where a non-alphanumeric character.

It is similar to [^a-zA-Z0-9_]

Expression	String	Matched
	1a2%c	1 match
\W	Machine Learning	No match

\w –

Matches where any alphanumeric character.(i.e. Digits and alphabets)

It is similar to [^a-zA-Z0-9_]

underscore _ is also considered an alphanumeric character.

Expression	String	Matched
	12$”: ;a	3 Matches
\w	%”>!	No match

\Z –

Matches if the specified characters are at the end of string.

Expression	String	Matched
	I like ML	1 Match
ML\Z	I like ML program	No match
	ML is good	No match

Match Object

You can get the methods and attributes of a match object using dir() function.

Here, explain some commonly used methods are:

match.group()

the group method returns the part of the string where there is a match.

match object
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 801 35

801 35

match.start(), match.end(), and match.span()

the start function returns the index of the start.

the end function returns the end index.

the span function returns the tuple containing start and end index.

match start and end
match.start()
2
match.end()
8

match span
match.span()
(2, 8)

match.re and match.string

the re attribute of a matched object returns a regular expression.

the string attribute returns the passed string.

match re
match.re
re.compile(r'(\d{3}) (\d{2})')
match string
match.string
'39801 356, 2102 1111'

r prefix:

R or r prefix is used before a regular expression.

For example, r’\n’ means two characters. Read how you can Send multiple Emails using Python script

r prefix
import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

['\n', '\r']

LEAVE A REPLY Cancel reply