Learn to Code via Tutorials on Repl.it!

← Back to all posts
Make a Full Lexer in Python!
Elderosa (23)

What is a Lexer?

A lexer is an analyzer that moves through your code looking at each character, and trying to create tokens out of them
This input

int a =5*5

Can be turned into

[(‘KeyWord’, ‘int’), (‘ID’, ‘a’), (‘assign’, ‘=‘), (‘num’, 5), (‘OP’, ‘*’), (‘num’, 5)]

by the Lexer you will learn how to create.

What if I Have Problems?

If you have trouble understanding something or you get errors, tell me and I’ll try my best to tell you what’s wrong.

Let’s Get Started!

First, open a new python repl with whatever name you choose.
Then create a function lex. This will be our function that basically does everything :).
Then, make a variable code set to input(). Make sure the code initialization is not in the function.
And call lex on code.

def lex(line):

code = input()
lex(code)

After this set a new variable, in lex, and name it count or lexeme_count or something, set it to 0.

def lex(line):
    lexeme_count = 0
code = input()
lex(code)

The lexeme_count variable is going to keep track of the chars you have already scanned.
Once you have that code, add a while loop saying that if chars you have scanned is less than length of line, keep scanning.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme_count += 1
code = input()
lex(code)

We will then make it more powerful by knowing what each lexeme is.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        lexeme_count += 1
code = input()
lex(code)

Then we can tell what the type is by using an if-elif-else statement to check the type of lexeme.
Make sure to move the lexeme_count += 1 part into the else.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
        
        elif lexeme.isalpha():

        else:
            lexeme_count += 1
code = input()
lex(code)

Let’s fill in the blank conditional blocks.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)

Whoa, Slow Down! What’s Going on?

What we’re doing here, is we are making three variables. One for the type of each token, one for the token itself, and one for the amount of lexemes ‘consumed’, ‘eaten’, or ‘scanned’
Then we are assigning those variables to a function call which takes the rest of the line, and gets the rest of the token. We do this both for digits and strings
After this, we change the lexeme_count by the amount of chars consumed so they keep up with each other.

Is this it?

This is certainly not the full lexical analyzer, so let’s add some identifier lexing!
Once we have finished with that, we can scan for literals, conditionals, operators, keywords, etc

Let’s Lex Some Identifiers!

Add another elif to the if-elif-else statement this will check if lexeme is equal to a a letter of the alphabet.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            
        else:
            lexeme_count += 1
code = input()
lex(code)

In this elif, we need to mirror what we did earlier, but with a call to a different function; lex_id().

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_id(line[lexeme_count])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)

Time To Make The functions!

We used three functions, but we haven’t defined them. Let’s go ahead and do that.

def lex_num(line):
    num= “”

def lex_str(line):
    delimiter = line[0]
    string = “”

def lex_id(line):
    id = “”

First we’ll make the lex_num function go till the end of the line and return the number.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”

def lex_id(line):
    id = “”

We will then fill out the lex_str() function doing the same thing as the digit one but for a string instead.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    id = “”

And now we will fill out the lex_id() function!

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    return ‘ID’, id, len(id)

What About KeyWords?

Yes, we will need to change the lex_id() function to know about key words...
What are you waiting for, read on!

We are going to make a list of keywords and check the id.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    if id in keys:
        return ‘key’, id, len(id)
    else:
        return ‘ID’, id, len(id)

The Entire Code

I know you want to go out and try this, but if you need it, here is the full working code.
BTW if you copy and paste this code, it will result in an error because I use curly quotes and those are not used in programming ‘“‘“‘“‘“‘“‘, I guess you either have to manually take them out and replace them XD, or just look at this code as a reference.
If you want to copy and paste :(, do it below on my better lexer

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    if id in keys:
        return ‘key’, id, len(id)
    else:
        return ‘ID’, id, len(id)

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_id(line[lexeme_count])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)
Commentshotnewtop
Wumi4 (460)

Nice! But I think that you should start with making tokens part first, for example, building a Token class that contains informations about the token(like column, lexeme, line and type), then start with the lexer. It will make everything more convenient since you just need to create a Token instance, eat informations for it, and return it to the user. It follows the DRY(Don't Repeat Yourself) rule very well! But anyway, nice one!

Elderosa (23)

I’m still working on it, @Wumi4

Wumi4 (460)

@Elderosa I know. Its just a suggestion.

Elderosa (23)

I think it’ll be better, especially if I am adding a parser to my project. I’ll need to make a class for that also, @Wumi4

NigamDahal (2)

Yeah, I agree with @Wumi4. This is more of a tokenizer rather than a lexer.

NigamDahal (2)

@Elderosa I'd love to see a parser. AST generation is not that simple, but good luck!

programmeruser (440)

@Elderosa parsing isn't that hard if you use recursive descent.

Elderosa (23)

btw, I’ve realized that handwritten compilers/interpreters are better :), @programmeruser

adl212 (153)

could you make a parser so that i don't need to code my own programming language? https://repl.it/@adl212/programmingLangScratch don't actually do it lol

Elderosa (23)

I will make a parser, but for myself. I might also make a tutorial on it like this, @adl212

Elderosa (23)

I’d like to see what this looks like in C++ or C. I kinda want it to be in a lower level language, @adl212

adl212 (153)

@Elderosa I think there are more possibilities of your coding language by using c++ or c. Like for example, in python you can only set variables one way, but in other languages you can do const <variable> = <value> which makes it read-only.

Elderosa (23)

You can try and convert it to C++, @adl212

Elderosa (23)

I’d like to see if it is faster or better, including your const example. @adl212

adl212 (153)

@Elderosa I really don't know because I haven't coded c before lol

Elderosa (23)

What about c++? M mind goes blank when it comes to C, @adl212

adl212 (153)

@Elderosa nope i only do python lol

[deleted]

best tutorial i found on the subject

Elderosa (23)

Really!. Thank you so much, @LucasAllori

CoolCoderSJ (301)

Btw also when appending to tokens use a list, not a tuple. It's easier to manage and call. So basically make tokens a 2D List

Elderosa (23)

Can I make a list of lists? @CoolCoderSJ

CoolCoderSJ (301)

@Elderosa yes u can, i forked it and did it, and it worked. its called a 2d list.

I explain 2D lists thoroughly here

Once you make your account, go to option 4, the manual level picker, and choose level 8.

Elderosa (23)

I’m not done yet, @elipie

Elderosa (23)

With the tutorial at least. The Lexer is pretty good @Elderosa

elipie (329)

@Elderosa shouldnt have posted it, then. lol, anything with PL dev, is an insta upvote from me

Elderosa (23)

I’m trying to figure out how I should make my parse tree? @elipie

elipie (329)

@Elderosa uhm, well, the syntax tree in a program isnt actually a tree. It literally is just doing stuff with the tokens. So dont try to make a tree, cuz you will get lost in trying to figure out where everything goes.

You know what a parse tree i am guessing, so its just the coding. THats the hard part lol.

this probably didnt help you at all
Elderosa (23)

ik lol. So I should just like try to run the tokens then and there? @elipie

Elderosa (23)

This is not a finished post