If you have trouble understanding something or you get errors, tell me and I’ll try my best to tell you what’s wrong.
Let’s Get Started!
First, open a new python repl with whatever name you choose. Then create a function lex. This will be our function that basically does everything :). Then, make a variable code set to input(). Make sure the code initialization is not in the function. And call lex on code.
def lex(line):
code = input()
lex(code)
After this set a new variable, in lex, and name it count or lexeme_count or something, set it to 0.
The lexeme_count variable is going to keep track of the chars you have already scanned. Once you have that code, add a while loop saying that if chars you have scanned is less than length of line, keep scanning.
Then we can tell what the type is by using an if-elif-else statement to check the type of lexeme. Make sure to move the lexeme_count += 1 part into the else.
What we’re doing here, is we are making three variables. One for the type of each token, one for the token itself, and one for the amount of lexemes ‘consumed’, ‘eaten’, or ‘scanned’ Then we are assigning those variables to a function call which takes the rest of the line, and gets the rest of the token. We do this both for digits and strings After this, we change the lexeme_count by the amount of chars consumed so they keep up with each other.
Is this it?
This is certainly not the full lexical analyzer, so let’s add some identifier lexing! Once we have finished with that, we can scan for literals, conditionals, operators, keywords, etc
Let’s Lex Some Identifiers!
Add another elif to the if-elif-else statement this will check if lexeme is equal to a a letter of the alphabet.
First we’ll make the lex_num function go till the end of the line and return the number.
def lex_num(line):
num= “”
for c in line:
if not c.isdigit():
break
return ‘num’, int(num), len(num)
def lex_str(line):
delimiter = line[0]
string = “”
def lex_id(line):
id = “”
We will then fill out the lex_str() function doing the same thing as the digit one but for a string instead.
def lex_num(line):
num= “”
for c in line:
if not c.isdigit():
break
return ‘num’, int(num), len(num)
def lex_str(line):
delimiter = line[0]
string = “”
for c in line:
string += c
return ‘str’, string, len(string)
def lex_id(line):
id = “”
And now we will fill out the lex_id() function!
def lex_num(line):
num= “”
for c in line:
if not c.isdigit():
break
return ‘num’, int(num), len(num)
def lex_str(line):
delimiter = line[0]
string = “”
for c in line:
if c==delimiter:
break
string += c
return ‘str’, string, len(string)
def lex_id(line):
id = “”
for c in line
if not c.isdigit() and not c.isalpha and c != “_”:
break
id += c
return ‘ID’, id, len(id)
What About KeyWords?
Yes, we will need to change the lex_id() function to know about key words... What are you waiting for, read on!
We are going to make a list of keywords and check the id.
def lex_num(line):
num= “”
for c in line:
if not c.isdigit():
break
return ‘num’, int(num), len(num)
def lex_str(line):
delimiter = line[0]
string = “”
for c in line:
if c==delimiter:
break
string += c
return ‘str’, string, len(string)
def lex_id(line):
keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
id = “”
for c in line
if not c.isdigit() and not c.isalpha and c != “_”:
break
id += c
if id in keys:
return ‘key’, id, len(id)
else:
return ‘ID’, id, len(id)
The Entire Code
I know you want to go out and try this, but if you need it, here is the full working code. BTW if you copy and paste this code, it will result in an error because I use curly quotes and those are not used in programming ‘“‘“‘“‘“‘“‘, I guess you either have to manually take them out and replace them XD, or just look at this code as a reference. If you want to copy and paste :(, do it below on my better lexer
def lex_num(line):
num= “”
for c in line:
if not c.isdigit():
break
return ‘num’, int(num), len(num)
def lex_str(line):
delimiter = line[0]
string = “”
for c in line:
if c==delimiter:
break
string += c
return ‘str’, string, len(string)
def lex_id(line):
keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’]
id = “”
for c in line
if not c.isdigit() and not c.isalpha and c != “_”:
break
id += c
if id in keys:
return ‘key’, id, len(id)
else:
return ‘ID’, id, len(id)
def lex(line):
lexeme_count = 0
while lexeme_count < len(line):
lexeme = line[lexeme_count]
if lexeme.isdigit():
typ, tok, consumed = lex_num(line[lexeme_count:])
lexeme_count += consumed
elif lexeme == ‘“‘ or lexeme == “‘“:
typ, tok, consumed = lex_str(line[lexeme_count:])
lexeme_count += consumed
elif lexeme.isalpha():
typ, tok, consumed = lex_id(line[lexeme_count])
lexeme_count += consumed
else:
lexeme_count += 1
code = input()
lex(code)
What is a Lexer?
A lexer is an analyzer that moves through your code looking at each character, and trying to create tokens out of them
This input
int a =5*5
Can be turned into
[(‘KeyWord’, ‘int’), (‘ID’, ‘a’), (‘assign’, ‘=‘), (‘num’, 5), (‘OP’, ‘*’), (‘num’, 5)]
by the Lexer you will learn how to create.
What if I Have Problems?
If you have trouble understanding something or you get errors, tell me and I’ll try my best to tell you what’s wrong.
Let’s Get Started!
First, open a new python repl with whatever name you choose.
Then create a function lex. This will be our function that basically does everything :).
Then, make a variable code set to input(). Make sure the code initialization is not in the function.
And call lex on code.
After this set a new variable, in lex, and name it
count
orlexeme_count
or something, set it to 0.The
lexeme_count
variable is going to keep track of the chars you have already scanned.Once you have that code, add a while loop saying that if chars you have scanned is less than length of line, keep scanning.
We will then make it more powerful by knowing what each lexeme is.
Then we can tell what the type is by using an if-elif-else statement to check the type of lexeme.
Make sure to move the
lexeme_count += 1
part into the else.Let’s fill in the blank conditional blocks.
Whoa, Slow Down! What’s Going on?
What we’re doing here, is we are making three variables. One for the type of each token, one for the token itself, and one for the amount of lexemes ‘consumed’, ‘eaten’, or ‘scanned’
Then we are assigning those variables to a function call which takes the rest of the line, and gets the rest of the token. We do this both for digits and strings
After this, we change the lexeme_count by the amount of chars consumed so they keep up with each other.
Is this it?
This is certainly not the full lexical analyzer, so let’s add some identifier lexing!
Once we have finished with that, we can scan for literals, conditionals, operators, keywords, etc
Let’s Lex Some Identifiers!
Add another elif to the if-elif-else statement this will check if lexeme is equal to a a letter of the alphabet.
In this elif, we need to mirror what we did earlier, but with a call to a different function; lex_id().
Time To Make The functions!
We used three functions, but we haven’t defined them. Let’s go ahead and do that.
First we’ll make the lex_num function go till the end of the line and return the number.
We will then fill out the lex_str() function doing the same thing as the digit one but for a string instead.
And now we will fill out the lex_id() function!
What About KeyWords?
Yes, we will need to change the lex_id() function to know about key words...
What are you waiting for, read on!
We are going to make a list of keywords and check the id.
The Entire Code
I know you want to go out and try this, but if you need it, here is the full working code.
BTW if you copy and paste this code, it will result in an error because I use curly quotes and those are not used in programming ‘“‘“‘“‘“‘“‘, I guess you either have to manually take them out and replace them XD, or just look at this code as a reference.
If you want to copy and paste :(, do it below on my better lexer
I’d like to see what this looks like in C++ or C. I kinda want it to be in a lower level language, @adl212