Beginner’s Guide to Regex

0
383

Introduction Beginner’s Guide to Regex

This is Beginner’s Guide to Regex, If you’re someone with a technical background, you must already be aware of the resourcefulness of RegEx. It’s just a fancy term for Regular Expression and the knowledge of its practice is pretty essential in the world of programming. The most common example on the top of my head of implementing regex is the validation on emails (end with @___.__), mobile numbers (10 digit limit), and passwords (mandatory uppercase, lowercase, digit, symbol) where certain criteria need to be met. This criterion is set by nothing other than RegEx. RegEx are essentially text patterns that are can be used to search or replace existing characters in a textual body.

Can you think of its application in the vast domain of data science? The use cases are bountiful especially in NLP (Natural Language Processing). They can also be used to filter and for preprocessing raw data for better analysis. Additionally, they can automate some data transformation tasks and also enable text mining. We will learn in detail about its applications further down in this article. But what’s important to understand is that it is a handy tool that every data scientist must be aware of!

Let’s move on to learn about some common methods and practices of regex in Python. Note that regex is universal across many programming languages like Java, php, C++ and have been around for quite some time.

Regex Modules in Python

Python comes with its own built-in module for manipulating and creating regular expressions, namely re. We will now check out some important functions within this package.

1. re.search()

This function’s purpose is to return the first occurrence of the pattern in the provided text. It matches only single occurrence of the pattern.

Syntax – re.search(pattern, string)

2. re.match()

The match() function is used to check a pattern expression against some text. In the below code, it only tries to find whether the given pattern is present in the beginning of the text. We will learn more about creating robust regex for matching in the latter part of this tutorial.

Syntax – re.match(pattern, string)

3. re.findall()

This function returns all the occurrences of the provided string function. Furthermore, it is recommended that you rely on the findall() function as it serves the purpose of both the formerly mentioned match() and search() functions.

Syntax – re.findall(pattern, string)

4. re.split()

The split() function is used to split the input string based on each occurrence of a pattern. Suppose you want to split the username and domain name in an email ID. This function is useful in such a scenario.

Syntax – re.split(pattern, string)

5. re.sub()

If you’ve ever dealt with string manipulation, you must know that this function obviously manipulates input text in the form of substrings. re.sub() takes three parameters. The first is the substring that needs to changed, the second is the string we want in its place, and the third is the main input text itself.

Basics of Regular Expression

First we discuss some metacharacters and quantifiers popularly used in regex.

Symbol/CharacterDescriptionExample
.Any character“wel…e”
^Start with “^lcome”
$Ends with “wel$”
*One or more occurrence. Type of quantifier.“com*”
+Zero or more occurrence. Type of quantifier.“com+”
{}Used to specify no. of occurrences. Type of quantifier.“[0-9]{10}” Allows exactly 10 digits

Let’s also look into some special sequence characters.

CharacterDescription
\AReturns match if the pattern is at the beginning of the string. “\AOnce”
\dReturns a match when string contains digits
\DReturns match if string does NOT contain digits
\sChecks for white spaces
\SChecks for non-white spaces
\wChecks for latin characters (a-z, A-Z)

Applications of regex in dealing with datasets

Regular Expressions are used for a variety of data processing and wrangling operations by data scientists. These include data pre-processing to natural language processing, pattern matching, web scraping, data extraction and numerous others!

Extracting Emails

There are many times when we need to extract important information from an email thread. If we were to do this via string manipulation methods, it would have been a cumbersome task. Regex comes to the rescue! In this example, we retrieve email IDs used in an email using findall() method.

Web Scraping

Data collection and processing is an elementary part of a Data Scientist’s work making up 80% of their work according to research. Given the revolutionization brought forth by the Internet, it is easier than ever to find data on the web. One can simply scrape websites to collect/generate data.

Though web scraping allows us to access vast reserves of data, it has its own set of issues. The acquireddata is in the form of html tags and full of noise. This is where Regex can be used effectively!

Web scraping is probably one of the most applications of regex in data science domain. Let’s work on scraping a particular part of our own blogging website – Fireblaze Blogs.

I found this particular list of different topics on the website along with their respective counts. Can we retrieve this data into a dataframe?

I have saved the html file code and then loaded it. Notice how shabby the data looks. Let’s try to retrieve some data from it.

First, we retreive the topics mentioned in the html file. I have saved it in a pandas Series. Notice the regex uses \d and \s pattern for letters and white space.

Similary I have gathered the no. of posts and the blogpost tags of each type of article.

Finally, I merge all the series together into a dataframe for use of some manipulation.

Conclusion

Thus in this article, we have seen the use of Beginner’s Guide to Regex and data science applications. We have also implemented some of its use-cases, but this barely covers what all regex can provide to a data scientist! It is also a vital element in NLP applications. You can learn about NLP and its Python library here NLTK. Regex may look intimidating at first, but once you get the hang of it, you will undoubtedly unlock its immense potential. It has tremendously grown since its applications in mere login portals and online money transactions.

LEAVE A REPLY

Please enter your comment!
Please enter your name here