Make a Full Lexer in Python!
Elderosa (23)

What is a Lexer?

A lexer is an analyzer that moves through your code looking at each character, and trying to create tokens out of them
This input

int a =5*5

Can be turned into

[(‘KeyWord’, ‘int’), (‘ID’, ‘a’), (‘assign’, ‘=‘), (‘num’, 5), (‘OP’, ‘*’), (‘num’, 5)]

by the Lexer you will learn how to create.

What if I Have Problems?

If you have trouble understanding something or you get errors, tell me and I’ll try my best to tell you what’s wrong.

Let’s Get Started!

First, open a new python repl with whatever name you choose.
Then create a function lex. This will be our function that basically does everything :).
Then, make a variable code set to input(). Make sure the code initialization is not in the function.
And call lex on code.

def lex(line):

code = input()
lex(code)

After this set a new variable, in lex, and name it count or lexeme_count or something, set it to 0.

def lex(line):
    lexeme_count = 0
code = input()
lex(code)

The lexeme_count variable is going to keep track of the chars you have already scanned.
Once you have that code, add a while loop saying that if chars you have scanned is less than length of line, keep scanning.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme_count += 1
code = input()
lex(code)

We will then make it more powerful by knowing what each lexeme is.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        lexeme_count += 1
code = input()
lex(code)

Then we can tell what the type is by using an if-elif-else statement to check the type of lexeme.
Make sure to move the lexeme_count += 1 part into the else.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
        
        elif lexeme.isalpha():

        else:
            lexeme_count += 1
code = input()
lex(code)

Let’s fill in the blank conditional blocks.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)

Whoa, Slow Down! What’s Going on?

What we’re doing here, is we are making three variables. One for the type of each token, one for the token itself, and one for the amount of lexemes ‘consumed’, ‘eaten’, or ‘scanned’
Then we are assigning those variables to a function call which takes the rest of the line, and gets the rest of the token. We do this both for digits and strings
After this, we change the lexeme_count by the amount of chars consumed so they keep up with each other.

Is this it?

This is certainly not the full lexical analyzer, so let’s add some identifier lexing!
Once we have finished with that, we can scan for literals, conditionals, operators, keywords, etc

Let’s Lex Some Identifiers!

Add another elif to the if-elif-else statement this will check if lexeme is equal to a a letter of the alphabet.

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            
        else:
            lexeme_count += 1
code = input()
lex(code)

In this elif, we need to mirror what we did earlier, but with a call to a different function; lex_id().

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_id(line[lexeme_count])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)

Time To Make The functions!

We used three functions, but we haven’t defined them. Let’s go ahead and do that.

def lex_num(line):
    num= “”

def lex_str(line):
    delimiter = line[0]
    string = “”

def lex_id(line):
    id = “”

First we’ll make the lex_num function go till the end of the line and return the number.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”

def lex_id(line):
    id = “”

We will then fill out the lex_str() function doing the same thing as the digit one but for a string instead.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    id = “”

And now we will fill out the lex_id() function!

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    return ‘ID’, id, len(id)

What About KeyWords?

Yes, we will need to change the lex_id() function to know about key words...
What are you waiting for, read on!

We are going to make a list of keywords and check the id.

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    if id in keys:
        return ‘key’, id, len(id)
    else:
        return ‘ID’, id, len(id)

The Entire Code

I know you want to go out and try this, but if you need it, here is the full working code.
BTW if you copy and paste this code, it will result in an error because I use curly quotes and those are not used in programming ‘“‘“‘“‘“‘“‘, I guess you either have to manually take them out and replace them XD, or just look at this code as a reference.
If you want to copy and paste :(, do it below on my better lexer

def lex_num(line):
    num= “”
    for c in line:
        if not c.isdigit():
            break
    return ‘num’, int(num), len(num)
    
def lex_str(line):
    delimiter = line[0]
    string = “”
    for c in line:
         if c==delimiter:
             break
         string += c
    return ‘str’, string, len(string)

def lex_id(line):
    keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
    id = “”
    for c in line
        if not c.isdigit() and not c.isalpha and c != “_”:
            break
        id += c
    if id in keys:
        return ‘key’, id, len(id)
    else:
        return ‘ID’, id, len(id)

def lex(line):
    lexeme_count = 0
    while lexeme_count < len(line):
        lexeme = line[lexeme_count]
        if lexeme.isdigit():
            typ, tok, consumed = lex_num(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme == ‘“‘ or lexeme == “‘“:
            typ, tok, consumed = lex_str(line[lexeme_count:])
            lexeme_count += consumed
        elif lexeme.isalpha():
            typ, tok, consumed = lex_id(line[lexeme_count])
            lexeme_count += consumed
        else:
            lexeme_count += 1
code = input()
lex(code)
You are viewing a single comment. View All
Elderosa (23)

I’m not done yet, @elipie

Elderosa (23)

With the tutorial at least. The Lexer is pretty good @Elderosa

elipie (323)

@Elderosa shouldnt have posted it, then. lol, anything with PL dev, is an insta upvote from me

Elderosa (23)

I’m trying to figure out how I should make my parse tree? @elipie

elipie (323)

@Elderosa uhm, well, the syntax tree in a program isnt actually a tree. It literally is just doing stuff with the tokens. So dont try to make a tree, cuz you will get lost in trying to figure out where everything goes.

You know what a parse tree i am guessing, so its just the coding. THats the hard part lol.

this probably didnt help you at all
Elderosa (23)

ik lol. So I should just like try to run the tokens then and there? @elipie