Let's Make a programming language! 1. the Lexer
abodialhazmi02 (3)

Let's Make a programming language! 1. the Lexer

How do programming languages work?

a programming language has two or more steps, the first one is lexing, lexing is just a fancy word for "tokenizer", it takes the contents of a file and breaks it into tokens, here is an example:

while True:
    print("Hello world!")

to:

WHILELOOP
BOOLEAN: True
COLON
PRINT FUNCTION
STRING: "Hello world!"

or, something like that.
the next step is parsing, parsing checks if the tokens are in a specific pattern, and if the pattern is unknown, like :

True while ("hello world!")print 

then it throws an error, but if it was correct, it would evaluate the tokens, so:

while True:
    print("hello world!")

to:

hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
etc...

What we will be building

we will be building a lexer for a calculator, then in the next
post we will make a parser for it.

CalcLang
> 1+1
Token(type='NUMBER', value='1', lineno=1, index=0)
Token(type='PLUS', value='+', lineno=1, index=1)
Token(type='NUMBER', value='1', lineno=1, index=2)
> 3+6-3
Token(type='NUMBER', value='3', lineno=1, index=0)
Token(type='PLUS', value='+', lineno=1, index=1)
Token(type='NUMBER', value='6', lineno=1, index=2)
Token(type='NEGATIVE', value='-', lineno=1, index=1)
Token(type='NUMBER', value='3', lineno=1, index=2)

Let's begin!

i will be using sly, sly stands for "sly lex yacc".

so create a new file and import sly's Lexer

from sly import Lexer

and then create a class called lexer, or anything you want, and let that class inherit from the sly Lexer class.

from now on your code would look like this:

from sly import Lexer

class lexer(Lexer):
    pass

let us create a set called tokens, this is important, it MUST be called tokens to make the lexer recognize all our tokens, like ints or floats, etc.

once you have created the tokens set, add tokens called NUMBER, PLUS, NEGATIVE, with no quotes, your IDE or text editor
or repl is probably throwing an error at you but ignore that.

tokens = {
    NUMBER,
    PLUS,
    NEGATIVE
}

by the way, it doesn't need to be called NUMBER, PLUS and NEGATIVE, you can just call them num and sum and sub, or something else.

now, let us tell the lexer what these tokens ACTUALLY look like, luckily, sly makes this very easy to do, first, make variables for each token, the variable must have the same name as the token, or else, the lexer will get confused,
second, let's add a raw string to each variable:

NUMBER = r'\d+' # regex for numbers
PLUS = r'\+' # must add a backspace.
NEGATIVE = r'\-'

now, let us try out the lexer by adding theses 4 lines of code:

for token in lexer().tokenize('1+1'):
    print(token)
Token(type='NUMBER', value='1', lineno=1, index=0)
Token(type='PLUS', value='+', lineno=1, index=1)
Token(type='NUMBER', value='1', lineno=1, index=2)

Success! we have created a lexer! but you may recognize a problem when you add spaces:

for token in lexer().tokenize('1 + 1'):
    print(token)
Token(type='NUMBER', value='1', lineno=1, index=0)
Illegal character ' ' at index 1

since we didn't define what is a whitespace, it is throwing an error at us, to fix this, just add an "ignore" variable that specifies what you want the lexer to ignore, so just add into the lexer class:

ignore = r" \t" # regex for whitespace

and the error should go away!

Now we have a fully functioning lexer!
i recommend adding your own tokens, like multiplication, and division.

the full code is here:

from sly import Lexer

class lexer(Lexer):
    tokens = {
        NUMBER,
        PLUS,
        NEGATIVE
    }
    
    NUMBER = r'\d+'
    PLUS = r'\+'
    NEGATIVE = r'\-'
    
    ignore = r" \t"


while True:
    data = input("> ")
    for token in lexer().tokenize(data):
        print(token)

i added some minor changes, but the rest of the code is still the same.

You are viewing a single comment. View All
EpicGamer007 (1121)

plez continue this lol. the first python tutorial i am really interested in.