Python RegEx Operations

0
83
Python RegEx Operations

Introduction To Python RegEx

The concept by American mathematician Stephen Cole Kleene in 1951. He describes a regular language using his mathematical notation called regular events.

A Python RegEx expression is a special sequence of characters that defines a pattern for complex string-matching functionality.

There are three regular expressions in python such as regexp, regex, and re. The regular expression also called (RE’s, or regexes, or regex pattern) are highly essentially programming language embedded inside python. Using this function all possible strings match as per our requirements. The regular expression language is relatively small and limited, so not all possible string processing tasks can be done using this function. Now you can learn how to define and manipulate string objects.

One simple technique in the python module is used to match the strings.

  • If two string is equal, Using equality(==) operator.

Application

  • Used in Search Engines
  • Search and Replace dialogs of word processors
  • Text editors

re Module:

 Python has a built-in package called re, also called Regular Expressions. There are so many functions in the re module to work with Python RegEx.

Import re module

  • If two string is equal, Using equality(==) operator.
Character String Matched
x1 match
[xyz]xy2 match 
Hey manNo match 
xyz yz yx7 match 

You can also specify a range function using (-) inside a square bracket.

For example:

  • [p – t] = [ pqrst ].
  • [5-10] = [5678910].

You can also complement the character using invert(^), at a start of the square bracket.

For example:

[^xyz] = means any character except x or y or z.

[^0-9] = means non-digit character.

# square bracket
sample = "Fireblaze AI School"

#Find all lower case characters alphabetically between "a" and "m":

sample_square= re.findall("[a-m]", sample)
print(sample_square)
  1. . – Period

Match ‘any single’ character.

Character String Matched
xno match
..xy1 match 
xyz1 match 
wxyz2 match 
  1. ^ –Caret

Used for ‘start with’ a character.

Character String Matched
x1 match
^xxy1 match 
zyxNo match 
^xyxyz 1 match 
zyxNo match
sample = "Fireblaze AI School"

#Check if the string starts with 'hello':

x = re.findall("^Fireblaze", sample)
if x:
  print("Yes, the string starts with 'Fireblaze'")
else:
  print("No match")
  1. $-Dollar

Used for ‘end with’ a character.

Character String Matched
x1 match
x$Manx1 match 
Hey manNo match 
import re

sample = "Fireblaze AI School"

#Check if the string ends with 'world':

x = re.findall("School$", sample)
if x:
  print("Yes, the string ends with 'School'")
else:
  print("No match")
  1. *-Star

Star symbol matches zero or more occurrences of the pattern.

Character String Matched
gi1 match
gir*lgirl1 match 
perl No match 
  1. + -Plus

plus symbol matches one or more occurrences of the pattern.

Character String Matched
xaNo match(no m character)
xmanMan1 match 
xmaaan1 match 
import re

txt = "Fireblaze AI School"

#Check if the string contains "ai" followed by 1 or more "x" characters:

x = re.findall("Schoo", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
['Schoo']
Yes, there is at least one match!
  1. ? – Question Mark

The question symbol matches zero or one occurrence of the pattern.

Character String Matched
xaNo match(no m character)
xma?nMan1 match 
xmaaanNo match(more than one a) 
xmnNo match (m not followed by a)
  1. {} – Braces:

Consider the {n,m}. This means at least n, and at most m repetitions of the pattern

Character String Matched
pqr xyzNo match
x{2,3}pqr xyyz1 match (at xyyz)
ppqr xyyyz2 matches(at pp and yyy)
ppqr xyyyyz3 matches(at pp and yyyy)
braces
import re
​
sample = "Fireblaze AI School"
​
#Check if the string contains "a" followed by exactly two "l" characters:
​
x = re.findall("aze{2}", sample)
​
print(x)
​
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
[]
No match
  1. | – Alteration

The special character standing or vertical bar is used for alteration. The standing bar also works as ‘or’ operation.

Expression String Matched
pqrNo match
x|yxaz1 match 
wxypyz2 matches
  1. () – Group

Parentheses symbol is used to group.

For example, (x|y|z)ab match by any string-like, x, y, z, a, b.

Expression String Matched
xy abNo match
(x|y|z)abxyab1 match (match at yab)
xay cabxy2 matches
  1. \- Backlash

Used for escape various characters including all metacharacters.

\$x match if a string contains $ followed by x.

If you not sure about any character, you can simply put \ in front of it.

backlash
mport re

sample = "That will be 123 rupees"

#Find all digit characters:

x = re.findall("\d", sample)
print(x)

import re
​
sample = "That will be 123 rupees"
​
#Find all digit characters:
​
x = re.findall("\d", sample)
print(x)
​
['1', '2', '3']

Special Sequences

The special character used for easy to write a pattern.

Here following ist of special character,

\A, \B, \b, \D, \d, \S, \s, \W, \w, \Z.

  • \A –  matches if the character is at the start of a string.
Expression String Matched
man  hasmatch
\Amanin man  No match 
  • \B – matches if the specific characters are not at the beginning or end of the end.
Expression String Matched
footballNo match
\BfooA football No match 
afootballmatch
  • \b  – opposite of \B, matches if the specific character are at the beginning or end of the word.
Expression String Matched
footballmatch
\BfooA football match 
afootballNo match
  • \D – 

Matches any non-decimal digit. Same as [^0-9]

Expression String Matched
2xy56”903 matches (except digit)
\D9876No match 
  • \d – 

Opposite of \D, means decimal digit.

Expression String Matched
54xyz33 match (digit)
\dData scienceNo match 
  • \S – 

Matches where a string contains any non-whitespace.

It is similar to [^ \t\n\r\f\v]

Expression String Matched
x y2 match
\SNo match 
  • \s – 

Matches where a string contains any whitespace.

It is similar to [^ \t\n\r\f\v]

Expression String Matched
Machine Learning1 match
\sMachineLearningNo match 
  • \W – 

Matches where a non-alphanumeric character.

It is similar to [^a-zA-Z0-9_]

Expression String Matched
1a2%c1 match
\WMachine Learning No match 
  • \w – 

Matches where any alphanumeric character.(i.e. Digits and alphabets)

It is similar to [^a-zA-Z0-9_]

underscore _ is also considered an alphanumeric character.

Expression String Matched
12$”: ;a3 Matches
\w%”>!No match 
  • \Z – 

Matches if the specified characters are at the end of string.

Expression String Matched
I like ML1 Match
ML\ZI like ML programNo match 
ML is goodNo match

Match Object

You can get the methods and attributes of a match object using dir() function.

Here, explain some commonly used methods are:

  • match.group()

the group method returns the part of the string where there is a match.

match object
import re
​
string = '39801 356, 2102 1111'
​
# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'
​
# match variable contains a Match object.
match = re.search(pattern, string) 
​
if match:
  print(match.group())
else:
  print("pattern not found")
​
# Output: 801 35
​
801 35
  • match.start(), match.end(), and match.span()

the start function returns the index of the start.

the end function returns the end index.

the span function returns the tuple containing start and end index.

match start and end
match.start()
2
match.end()
8
match span
match.span()
(2, 8)
  • match.re and match.string

the re attribute of a matched object returns a regular expression.

the string attribute returns the passed string.

match re
match.re
re.compile(r'(\d{3}) (\d{2})')
match string
match.string
'39801 356, 2102 1111'

r prefix:

R or r prefix is used before a regular expression.

For example, r’\n’ means two characters.

r prefix
import re
​
string = '\n and \r are escape sequences.'
​
result = re.findall(r'[\n\r]', string) 
print(result)
​
['\n', '\r']

LEAVE A REPLY

Please enter your comment!
Please enter your name here