The Basics of Regex
h
sugarfi (602)

Regex, or regular expressions, is a very useful tool for parsing and lexing text. However, a lot of people don't know how to use them. I decided to create the tutorial so people could learn.

What are Regular Expressions?

Like I said before, regular expressions are used for parsing and lexing text. But what does that mean, exactly? Well, lexing is short for lexical analysis: splitting a string of text up into tokens. This is especially useful when writing programming languages, because you must have a way to split a raw string of code into tokens the computer can understand. A token is just a bit of text with a specific type, like a name or a value. Regex are used to match those tokens. Regex are what is called patterns, outlines or sketches for how text should look. To match a regex is for some text to fit the design that it lays out.

Basic Regex

Regex are written using several special characters. Any character that is not one of the special characters is matched literally. That means that any character that is not one the special characters is just matched as it is. For example, the regex abc will match only the text abc. It will not match a, ab, or bc, or anything other than abc. However, literal matches are not very interesting. Let's get on to some of the special characters.

Basic Special Characters

The most basic special character is .. The . matches any character. For example, the regex a.c will match abc, aac, a1c, and any other combination of a, a character, and a c. But still, that's not very interesting. We could do that easily without regex. Now it's time to add something to the mix: qualifiers.

Qualifiers

A qualifier is a special character that makes the previous character or group or characters behave differently. The most basic qualifier is ?. It makes the previous character or group of characters optional. For example, ab? will match a or ab, because the b is optional. The real fun is when you use this with other special characters: you can use a.? to match a, or a followed by any character.

Repetition

There are more qualifiers, that are used to repeat characters. The + qualifier matches the previous character one or more times. a+ will match any number of as greater than one. * is like +, but it matches the character zero or more times. a* matches any amount of as, and also an empty string. But what if you only want to match certain amounts of characters? There is another qualifier for that. Let's say you wanted to match 3, and only 3, characters. You could write ..., but that's too boring. Instead, you can use the { and } qualifiers. The way it works is that you put the qualifier, then the {, then the number of characters to match, and finally a closing }. To match 3 characters, you can use .{3} then. That will match 3 characters, and only 3. But what if you want to match, say, between 1 and 3 characters? Luckily, you can use the same qualifier. To match a range of characters, you use the same brackets, but with two numbers separated by a comma. To match between 1 and 3 characters, you can use .{1, 3}. The comma tells regex to match between 1 and 3 characters.

Ranges

Sometimes you only want to match certain characters. That is what ranges are good for. A range is just a group of characters, where only one of them will be matched. To define a range, you use the [ and ] characters. You put the characters to match inside the brackets. For example, [abc] will match either an a, a b, or c. But sometimes even that is not enough. What if you want to match any number? It would be cumbersome to type out something like [0123456789]. Luckily, regex provides a way to do that. You simply provide two characters separated by a dash inside of a range, and the range will match those characters and anything in between. For example, [0-9] would be the same as [0123456789]. Regex matches every value between 0 and 9, including 0 and 9.

Groups

The last topic I'll cover here is groups. A group is just some text within parentheses that acts as a single character, meaning you can use qualifiers on it. Groups are useful when you want to match several characters at once. For example, if we did (abc)?, we could match either the characters abc or nothing. The group acts as a single character. There is one other thing groups are good for: matching two alternate strings. If we put multiple characters within a group, separated by a |. For example, we could write (abc|def) to match either the characters abc or def, but not both. Note that you can nest groups, so I could do something like ((abc|def)123) to match either abc or def followed by 123.

Escaping

What if you want to match one of the special characters? Maybe I wanted to match the text a.. I have to do what is called escaping. I use a backslash, \, before the special character. Thus, a regex to match a. would be a\.. Any special character can be escaped this way, including brackets ({, (, [, etc.) and the backslash itself.

Example

Finally, a tutorial is not much good without an example. For this example, we'll write a simple regex to match any decimal number (5, 5.6, -1, etc.). Then we'll write a simple Python script to test it out for us. First, the regex. We'll start off with [0-9]+. This simply matches any character between 0 and 9 one or more times. We want to support negative numbers, so we add a -? to the beginning, so the regex can match an optional -. This gives us -?[0-9]+. Finally, we want to add support for decimals. So we add (\.[0-9]+)? to the end. This defines a group, which matches a literal . followed by more digits, then makes it optional. Thus, our final regex is -?[0-9]+(\.[0-9]+)?. Lastly, we meed to write a Python script to test this regex. Here it is:

import re

regex = re.compile(r'-?[0-9]+(\.[0-9]+)?')

while True:
    num = input('Type some text: ')
    if regex.match(num):
         print('That is a valid number!')
    else:
         print('That is not a number!')

First, we import the re module, for dealing with regex. Then we use re.compile to compile our regular expression into a Python object. You might wonder why we prefix the string with r. This tells Python that any backslashes in the string should be interpreted as literal backslashes, not as escaping the character after them. Finally, we ask the user to input some text. Then we call regex.match on it. If the regex matches the text, match returns a re.match object. Otherwise it returns None. Try pasting this into repl.it to see it run!

End

Hopefully now you have at least a basic understanding of regex. If you want to learn more, check out the Python regex howto for a better tutorial, or check out Regexr for a great way to test and run your regex on various texts. I hope you enjoyed this tutorial, and thanks for reading!

You are viewing a single comment. View All
sugarfi (602)

@StudentFires regexr.com is a great site for regex stuff, it seems fitting to recommend it.