When you write your own programming language, you control the entire programmer experience. This allows you to shape exact how each aspect of your language works and how a developer interacts with it. This allows you to make a language with things you like from other languages and none of the stuff you don't. In addition, learning about programming language internals can help you better understand the internals of programming languages you use every day, which can make you a better programmer.
How programming languages work
Every programming language is different in the way it runs, but many consist of a couple fundamental steps: lexing and parsing.
Introduction to Lexing
Lexing is short for LEXical analysis. The lex step is where the language takes the raw code you've written and converts it into an easily parsable structure. This step interprets the syntax of your language and turns next into special symbols inside the language called tokens. For example, let's say you have some code you want to parse. To keep it simple I'll use python-like syntax, but could be anything. It doesn't even have to be text.
# this is a comment
a = (1 + 1)
A lexer to parse this code might do the following:
Discard all comments
Produce a token that represents a variable name
Produce left and right parenthesis tokens
Convert literals like numbers or strings to tokens
Produce tokens for nath operations like + - * / (and maybe bitwise/logical operators as well)
The lexer will take the raw code and interpret it into a list of tokens. The lexer can also be used to insure that two pieces of code that may be different, like 1 + 1 and 1+1 are still parsed the same way.
For the code above, it might generate tokens like this:
NAME(a) EQUALS LPAREN NUMBER(1) PLUS NUMBER(2) RPAREN
Tokens can be in many forms, but the main idea here is that they are a standard and easy to parse way of representing the code.
Introduction to Parsing
The parser is the next step in the running of your language. Now that the lexer has turned the text into consistent tokens, the parser simplifies and executes them. Parser rules recognize a sequence of tokens and do something about them. Let's look at a simple example for a parser with the same tokens as above.
A simple parser could just say:
If I see the GREET token and then a NAME token, print Hello, and the the name.
A more complicated parser aiming to parse the code above might have these rules, which we will explore later:
Try to classify as much code as possible as an expression. By "as much code as possible" I mean the parser will first try to consider a full mathematical operation as an expression, and then if that fails convert a single variable or number to an expression. This ensure that as much code as possible will be matched as an expression. The "expression" concept allows us to catch many patterns of tokens with one piece of code. We will use the expression in the next step.
Now that we have a concept of an expression, we can tell the parser that if it sees the tokens NAME EQUALS and then an expression, that means a variable is being assigned.
Using PLY to write your language
What is PLY?
Now that we know the basics of lexing and parsing, lets start writing some python code to do it. PLY stands for Python Lex Yacc. It is a library you can use to make your own programming language with python. Lex is a well known library for writing lexers. Yacc stands for "Yet Another Compiler Compiler" which means it compiles new languages, which are compilers themself.
This tutorial is a short example, but the PLY documentation is an amazing resource with tons of examples. I would highly recommend that you check it out if you are using PLY.
For this example, we are going to be building a simple calculator with variables. If you want to see the fully completed example, you can fork this repl: [TODO!!]
Lexing with PLY lex
Lexer tokens
Lets start our example! Fire up a new python repl and follow along with the code samples. To start off, we need to import PLY:
from ply import lex, yacc
Now let's define our first token. PLY requires you to have a tokens list which contains every token the lexer can produce. Let's define our first token, PLUS for the plus sign:
tokens = [
'PLUS',
]
t_PLUS = r'\+'
A string that looks like r'' is special in python. The r prefix means "raw" which includes backslashes in the string. For example, to make define the string \+ in python, you could either do '\\+' or r'\+'. We are going to be using a lot of backslashes, so raw strings make things a lot easier.
But what does \+ mean? Well in the lexer, tokens are mainly parsed using regexes. A regex is like a special programming language specifically for matching patterns in text. A great resource for regexes is regex101.com where you can test your regexes with syntax highlighting and see explanations of each part.
I'm going to explain the regexes included in this tutorial, but if you want to learn more you can play around with regex101 or read one of the many good regex tutorials on the internet.
The regex \+ means "match a single character +". We have to put a backshlash before it because + normally has a special meaning in regex so we have to "escape" it to show we want to match a + literally.
We are also required to define a function that runs when the lexer encounters an error:
def t_error(t):
print(f"Illegal character {t.value[0]!r}")
t.lexer.skip(1)
This function just prints out a warning when it hits a character it doesn't recognize and then skips it (the !r means repr so it will print out quotes around the character). You can change this to be whatever you want in your language though.
Optionally, you can define a newline token which isn't produced in the output of the lexer, but keeps track of each line.
Since this token is a function, we can define the regex in docstring of the function instead. The function takes a paramater t, which is a special object representing the match that the lexer found. We can access the lexer using the the t.lexer attribute.
This function matches at least one newline character and then increases the line number by the amount that it sees. This allows the lexer to known what line number its on at all times using the lexer.lineno variable.
Now we can use the line number in our error function:
def t_error(t):
print(f"Illegal character {t.value[0]!r} on line {t.lexer.lineno}")
t.lexer.skip(1)
Let's test out the lexer! This is just some temporary code, you don't have to know what this code does, because once we implement a parser, the parser will run the lexer for you.
lexer = lex.lex()
lexer.input('+')
for token in lexer:
print(token)
Play around with the value passed to lex.input. You should notice that any character other than a plus sign makes the error message print out, but doesn't crash the program. In your language, you can make it gracefully ignore lex errors like this or make it stop running by editing the t_error function. If you add more lines to the input string, the line number in the error message should change.
More complicated tokens
Let's delete the test token add some more complicated tokens. Replace your tokens list and the t_PLUS line with the following code:
Let's explore the regex we have in the t_ID function. This regex is more complicated that the simple ones we've used before.
First, we have [a-zA-Z_]. This is a character class in regex. It means, match any lowercase letter, uppercase letter, or underscore. Next we have [a-zA-Z0-9_]. This is the same as above except numbers are also included. Finally, we have *. This means "repeat the previous group or class zero to unlimited times".
Why do we structure the regex like this? Having two separate classes makes sure that the first one must match for it to be a valid variable. If we exclude numbers from the first class, it not only doesn't match just regular numbers, but makes sure you can't start a variable with a number. You can still have numbers in the variable name, because they are matched by the second class of the regex.
In the code, we first have a dictionary of reserved names. This is a mapping of patterns to the token type that they should be. The only one we have says that greet should be mapped to the GREET token.
The code that sets up the tokens list takes all of the possible reserved token values, in this is example its just ['GREET'] and adds on ['SPACE'], giving us ['GREET', 'SPACE'] automatically!
But why do we have to do this? Couldn't we just use something like the following code?
# Don't use this code! It doesn't work!
t_GREET = r'greet'
t_SPACE = r'[ ]'
t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
Actually, if we used that code, greet would never be matched! The lexer would match it with the NAME token. In order to avoid this, we define a new type of token which is a function. This function has the regex as its docstring and is passed a t paramater. This paramater has a value attribute which is the pattern matched. The code inside this function simply checks if this value is one of the special reserved names we defined before. If it is, we set the special type attribute of the t paramter. This type controls the type of token which is produced from the pattern. When it sees the name greet, it will see greet is in the reserved names dictionary and produce a token type of GREET because that is the corresponding value in the dictionary. Otherwise, it will produce a NAME token because this is a regular variable.
This allows you to add more reserved terms easily later, its as simple as adding a value to the dictionary.
If needed, you could make also make the key of the reserved names dicitonary a regex and the match each regex against t.value in the function.
If you want to change these rules for your language, feel free!
Parsing with PLY yacc
Fair warning: Yacc can sometimes be hard to use and debug, every if you know python well. Keep in mind, you don't have to use both lex and yacc, if you want you can just use lex and then write your own code to parse the tokens. With that said lets get started.
Yacc basics
Before we get started, delete the lexer testing code (everything from lexer.input onward). When we run the parser, the lexer is automatially run.
Let's add our first parser rule!
def p_hello(t):
'statement : GREET SPACE NAME'
print(list(t))
print(f"Hello, {t[3]}")
Let's break this down. Again, we have information on the rule in the docstring. This information is called a BNF Grammar. A statement in BNF Grammar consists of a grammar rule known as a non-terminal and terminals. In the example above, statement is the non-terminal and GREET SPACE NAME are terminals. The left-hand side describes what is produced by the rule, and the right-hand side describes what matches the rule. The right hand side can also have non-terminals in it, just be careful to avoid infinite loops.
Basically, the yacc parser works by pushing tokens onto a stack, and looking at the current stack and the next token and seeing if they match any rules that it can use to simplify them. Here is a more in-depth explanation and example.
Before the above example can run, we still have to add some more code. Just like for the lexer, the error handler is required:
def p_error(t):
if t is None: # lexer error, already handled
return
print(f"Syntax Error: {t.value!r}")
Now let's create and run the parser:
parser = yacc.yacc()
parser.parse('greet replit')
If you run this code you should see:
[None, 'greet', ' ', 'replit']
Hello, replit
The first line is the list version of the object passed to the parser function. The first value is the statement that will be produced from the function, so it is None. Next, we have the values of the tokens we specified in the rule. This is where the t[3] part comes from. This is the third item in the array, which is the NAME token, so our parser prints out Hello, replit!
Note: Creating the parser tables is a relatively expensive operation, so the parser creates a file called parsetab.py which it can load the parse tables from if they haven't changed. You can change this filename by passing a kwarg into the yacc initialization, like parser = yacc.yacc(tabmodule='fooparsetab')
More complicated parsing: Calculator
This example is different from our running example, so I will just show a full code example and explain it.
First we start off with the tokens: numbers, mathematical operations, and parenthesis. You might notice that I didn't use the reserved_tokens trick, but you can implement it if you want.
Next we have a simple number token which matches 0-9 with \d+ and then converts its value from a string to an integer.
The next code we haven't used before is t_ignore. This variable represents a list of all characters the lexer should ignore, which is \t which means spaces and tabs. When the lexer sees these, it will just skip them. This allows users to add spaces without it affecting the lexer.
Now we have 3 parser directives.
The first is a large one, producing an expression from 4 possible input values, one for each math operation. Each input has an expression on either side of the math operator. Inside this directive, we have some (pretty ugly) code that performs the correct operation based on the operation token given. If you want to make this prettier, consider a dictionary using the python stdlib operator module.
Next, we define an expression with parenthesis around it as being the same as the expression inside. This makes parenthesis value be substituted in for them, making them evaluate inside first. With very little code we created a very complicated rule that can deal with nested parenthesis correctly.
Finally, we define a number as being able to be an expression, which allows a number to be used as one of the expressions in rule 1.
For a challenge, try adding variables into this calculator! You should be able to set variables by using syntax like varname = any_expression and you should be able to use variables in expressions. If you're stuck, see one solution from the PLY docs.
Thats it!
Thanks for reading! If you have questions, feel free to ask on the Replit discord's #help-and-reviews channel, or just the comments. Have fun!
Making your own programming language with Python
Making your own programming language with Python
Why make your own language?
When you write your own programming language, you control the entire programmer experience.
This allows you to shape exact how each aspect of your language works and how a developer interacts with it.
This allows you to make a language with things you like from other languages and none of the stuff you don't.
In addition, learning about programming language internals can help you better understand the internals of programming languages you use every day, which can make you a better programmer.
How programming languages work
Every programming language is different in the way it runs, but many consist of a couple fundamental steps: lexing and parsing.
Introduction to Lexing
Lexing is short for LEXical analysis.
The lex step is where the language takes the raw code you've written and converts it into an easily parsable structure.
This step interprets the syntax of your language and turns next into special symbols inside the language called tokens.
For example, let's say you have some code you want to parse. To keep it simple I'll use python-like syntax, but could be anything. It doesn't even have to be text.
A lexer to parse this code might do the following:
+ - * /
(and maybe bitwise/logical operators as well)The lexer will take the raw code and interpret it into a list of tokens.
The lexer can also be used to insure that two pieces of code that may be different, like
1 + 1
and1+1
are still parsed the same way.For the code above, it might generate tokens like this:
Tokens can be in many forms, but the main idea here is that they are a standard and easy to parse way of representing the code.
Introduction to Parsing
The parser is the next step in the running of your language.
Now that the lexer has turned the text into consistent tokens, the parser simplifies and executes them.
Parser rules recognize a sequence of tokens and do something about them.
Let's look at a simple example for a parser with the same tokens as above.
A simple parser could just say:
GREET
token and then aNAME
token, printHello,
and the the name.A more complicated parser aiming to parse the code above might have these rules, which we will explore later:
NAME EQUALS
and then an expression, that means a variable is being assigned.Using PLY to write your language
What is PLY?
Now that we know the basics of lexing and parsing, lets start writing some python code to do it.
PLY stands for Python Lex Yacc.
It is a library you can use to make your own programming language with python.
Lex is a well known library for writing lexers.
Yacc stands for "Yet Another Compiler Compiler" which means it compiles new languages, which are compilers themself.
This tutorial is a short example, but the PLY documentation is an amazing resource with tons of examples. I would highly recommend that you check it out if you are using PLY.
For this example, we are going to be building a simple calculator with variables. If you want to see the fully completed example, you can fork this repl: [TODO!!]
Lexing with PLY lex
Lexer tokens
Lets start our example! Fire up a new python repl and follow along with the code samples.
To start off, we need to import PLY:
Now let's define our first token. PLY requires you to have a
tokens
list which contains every token the lexer can produce. Let's define our first token,PLUS
for the plus sign:A string that looks like
r''
is special in python. Ther
prefix means "raw" which includes backslashes in the string. For example, to make define the string\+
in python, you could either do'\\+'
orr'\+'
. We are going to be using a lot of backslashes, so raw strings make things a lot easier.But what does
\+
mean?Well in the lexer, tokens are mainly parsed using regexes.
A regex is like a special programming language specifically for matching patterns in text.
A great resource for regexes is regex101.com where you can test your regexes with syntax highlighting and see explanations of each part.
I'm going to explain the regexes included in this tutorial, but if you want to learn more you can play around with regex101 or read one of the many good regex tutorials on the internet.
The regex
\+
means "match a single character +".We have to put a backshlash before it because
+
normally has a special meaning in regex so we have to "escape" it to show we want to match a+
literally.We are also required to define a function that runs when the lexer encounters an error:
This function just prints out a warning when it hits a character it doesn't recognize and then skips it (the
!r
means repr so it will print out quotes around the character).You can change this to be whatever you want in your language though.
Optionally, you can define a newline token which isn't produced in the output of the lexer, but keeps track of each line.
Since this token is a function, we can define the regex in docstring of the function instead.
The function takes a paramater t, which is a special object representing the match that the lexer found. We can access the
lexer
using the thet.lexer
attribute.This function matches at least one newline character and then increases the line number by the amount that it sees. This allows the lexer to known what line number its on at all times using the
lexer.lineno
variable.Now we can use the line number in our error function:
Let's test out the lexer!
This is just some temporary code, you don't have to know what this code does, because once we implement a parser, the parser will run the lexer for you.
Play around with the value passed to
lex.input
.You should notice that any character other than a plus sign makes the error message print out, but doesn't crash the program.
In your language, you can make it gracefully ignore lex errors like this or make it stop running by editing the
t_error
function.If you add more lines to the input string, the line number in the error message should change.
More complicated tokens
Let's delete the test token add some more complicated tokens.
Replace your tokens list and the
t_PLUS
line with the following code:Let's explore the regex we have in the
t_ID
function.This regex is more complicated that the simple ones we've used before.
First, we have
[a-zA-Z_]
. This is a character class in regex. It means, match any lowercase letter, uppercase letter, or underscore.Next we have
[a-zA-Z0-9_]
. This is the same as above except numbers are also included.Finally, we have
*
. This means "repeat the previous group or class zero to unlimited times".Why do we structure the regex like this?
Having two separate classes makes sure that the first one must match for it to be a valid variable.
If we exclude numbers from the first class, it not only doesn't match just regular numbers, but makes sure you can't start a variable with a number.
You can still have numbers in the variable name, because they are matched by the second class of the regex.
In the code, we first have a dictionary of reserved names.
This is a mapping of patterns to the token type that they should be.
The only one we have says that
greet
should be mapped to theGREET
token.The code that sets up the tokens list takes all of the possible reserved token values, in this is example its just
['GREET']
and adds on['SPACE']
, giving us['GREET', 'SPACE']
automatically!But why do we have to do this? Couldn't we just use something like the following code?
Actually, if we used that code,
greet
would never be matched! The lexer would match it with theNAME
token. In order to avoid this, we define a new type of token which is a function. This function has the regex as its docstring and is passed at
paramater. This paramater has avalue
attribute which is the pattern matched.The code inside this function simply checks if this value is one of the special reserved names we defined before. If it is, we set the special
type
attribute of thet
paramter. Thistype
controls the type of token which is produced from the pattern. When it sees the name greet, it will see greet is in the reserved names dictionary and produce a token type ofGREET
because that is the corresponding value in the dictionary. Otherwise, it will produce aNAME
token because this is a regular variable.This allows you to add more reserved terms easily later, its as simple as adding a value to the dictionary.
If needed, you could make also make the key of the reserved names dicitonary a regex and the match each regex against
t.value
in the function.If you want to change these rules for your language, feel free!
Parsing with PLY yacc
Fair warning: Yacc can sometimes be hard to use and debug, every if you know python well.
Keep in mind, you don't have to use both lex and yacc, if you want you can just use lex and then write your own code to parse the tokens.
With that said lets get started.
Yacc basics
Before we get started, delete the lexer testing code (everything from
lexer.input
onward).When we run the parser, the lexer is automatially run.
Let's add our first parser rule!
Let's break this down.
Again, we have information on the rule in the docstring.
This information is called a BNF Grammar. A statement in BNF Grammar consists of a grammar rule known as a non-terminal and terminals.
In the example above,
statement
is the non-terminal andGREET SPACE NAME
are terminals.The left-hand side describes what is produced by the rule, and the right-hand side describes what matches the rule.
The right hand side can also have non-terminals in it, just be careful to avoid infinite loops.
Basically, the
yacc
parser works by pushing tokens onto a stack, and looking at the current stack and the next token and seeing if they match any rules that it can use to simplify them. Here is a more in-depth explanation and example.Before the above example can run, we still have to add some more code.
Just like for the lexer, the error handler is required:
Now let's create and run the parser:
If you run this code you should see:
The first line is the list version of the object passed to the parser function.
The first value is the
statement
that will be produced from the function, so it isNone
.Next, we have the values of the tokens we specified in the rule.
This is where the
t[3]
part comes from. This is the third item in the array, which is theNAME
token, so our parser prints outHello, replit
!More complicated parsing: Calculator
This example is different from our running example, so I will just show a full code example and explain it.
First we start off with the tokens: numbers, mathematical operations, and parenthesis.
You might notice that I didn't use the
reserved_tokens
trick, but you can implement it if you want.Next we have a simple number token which matches 0-9 with
\d+
and then converts its value from a string to an integer.The next code we haven't used before is
t_ignore
.This variable represents a list of all characters the lexer should ignore, which is
\t
which means spaces and tabs.When the lexer sees these, it will just skip them. This allows users to add spaces without it affecting the lexer.
Now we have 3 parser directives.
The first is a large one, producing an expression from 4 possible input values, one for each math operation.
Each input has an
expression
on either side of the math operator.Inside this directive, we have some (pretty ugly) code that performs the correct operation based on the operation token given.
If you want to make this prettier, consider a dictionary using the python stdlib
operator
module.Next, we define an expression with parenthesis around it as being the same as the expression inside.
This makes parenthesis value be substituted in for them, making them evaluate inside first.
With very little code we created a very complicated rule that can deal with nested parenthesis correctly.
Finally, we define a number as being able to be an expression, which allows a number to be used as one of the expressions in rule 1.
Thats it!
Thanks for reading! If you have questions, feel free to ask on the Replit discord's
#help-and-reviews
channel, or just the comments.Have fun!
@AbhayBhat I knew that. I meant that interpreting another lang in java would make this lang slow